BioEnEx (Bio-entity Extractor)
This page contains source code (see the bottom of this page) and usage instructions of the BioEnEx system. The tool is able to annotate multiple biomedical semantic types with high performance. For example, the latest version of the system obtains an F1 score of 86.22% for Gene/Protein mention recognition on BioCreative II GM Corpus without using any dictionary or lexicon.

The version of the system available in this page is only configured  to extract "Diseases" with high precision and recall. This version achieves an  F1 score of 81.08% on Arizona Disease Corpus,

Please write to Faisal Chowdhury (fmchowdhury@gmail.com) if you have any question regarding the system or if you want to have the latest version.

Please acknowledge your access to this system by citing the following paper if you use it (or any part of its code) in research or for other purposes:
         # Md. Faisal Mahbub Chowdhury, Alberto Lavelli, “Disease Mention Recognition with Specific Features”, In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing (BioNLP), ACL 2010pages 91-98, Uppsala, Sweden, July, 2010

If you use the system (or any part of it) for Gene/Protein mention recognition, then please acknowledge it by citing the following paper:

         # Md. Faisal Mahbub Chowdhury and Alberto Lavelli, "Assessing the practical usability of an automatically annotated corpus", In Proceedings of the 5th Linguistics Annotation Workshop (LAW V), ACL-HLT 2011, pages 101–109, Portland, Oregon, USA, June 23-24, 2011

We used an earlier version of the tool in the CALBC Challenge I. Our system achieved the highest overall F-measure (above 86%) among all the participants of the NER task for multiple semantic groups (i.e. genes/proteins, diseases, species and chemicals). See the following extended abstract for details:

         # Md. Faisal Mahbub Chowdhury, Alberto Lavelli, “Robust Biomedical Entity Recognition Using Optimal Feature Set”,  In Proceedings of the 1st CALBC Workshop, pages 29-30, EMBL-EBI, Hinxton, Cambridge, U.K., 17-18 June, 2010.

Dependencies:

The system uses some other tools for some of the data processing steps. Please follow the instruction to install the dependencies:

1. Install LVG Light 2012 Tool from http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/web/download.html and follow the instructions in that page
- Modify the value of the "lvgInstallDir" in BioTeMi.config file with the directory path of the lvg2012light
  (Please make sure there is no space character in the directory path)
- Build a jar lvg2012api.jar using the files in lvg2012light/bin directory
- Add the following jars in your java classpath
-- lvg2012api.jar
-- lvg2012dist.jar (available in the lvg2012light/dist directory)

2. Install Mallet
- Download and install Mallet from http://mallet.cs.umass.edu/
- Add the following jars in your java classpath which should be available under the "dist" folder of your mallet installation directory
-- mallet.jar
-- mallet-deps.jar

3. Download and extract Stanford parser of the version 2012-03-09
- Add the following jars in your java classpath which should be available under the installation directory of the stanford parser
-- stanford-parser.jar
-- stanford-parser-2012-03-09-models.jar


Input Data Format:

BioEnEx anticipates data (in both training and test data file) according to the following format: Sentence_ID followed by Sentence in each input line. Sentence_ID cannot have any space character. There should be no blank lines between any two consecutive input lines. Here is an example -

7 Therefore, we screened eight familial gastric cancer kindreds of British and Irish origin for germline CDH1 mutations, by SSCP analysis of all 16 exons and flanking sequences.
8 We have confirmed that germline mutations in the CDH1 gene cause familial gastric cancer in non-Maori populations.
9 However, only a minority of familial gastric cancers can be accounted for by CDH1 mutations.
.............
.................


Input Annotation Format:

BioEnEx anticipates entity annotations according to the following format: Sentence_ID|X Y|ENTITY_NAME in each input line. Here, X and Y corresponds to the (non-space) starting and ending character index of the ENTITY_NAME inside the sentence having id Sentence_ID. There should be no blank lines between any two consecutive input lines. Here is an example -

7|25 45|familial gastric cancer
8|54 74|familial gastric cancer
9|23 44|familial gastric cancers
.............
.................


Output Format:

The output annotation of the input test data will be provided as the same format as input annotation, i.e. Sentence_ID|X Y|ENTITY_NAME in each output line.


How to run the system:

You can use directly the jar org.fbk.it.hlt.bioEntityExtractor inside your java code or you can run it from shell. The program accepts the following parameters -

--size-train What percentage of the data is to be used for training, if there is no separate test data. Default: 90
--n         How many folds are to be used for n-folds cross validation. Default: 1, i.e. no cross-fold validation.
--biewo         Whether to annotate using to BIEWO labelling format. Default: false (i.e. IOB2)
--train-id-sen The filename for reading training data.
--test-id-sen The filename for reading test data.
--model         If model file name (i.e. the value of this parameter) is given training is skipped. Default: null
--entity         The filename for the bio-entity annotations of the test data.
--pre-process   Whether to pre-process data. Default: true
--train-entity The filename for the bio-entity annotations of the training data.
--db         The filename for reading entity database.
--fold-path      The pathname for sentence IDs folds.
--eval                Whether evaluation of the annotation should be done. Default: true.
--sp                   Whether syntactic parsing should be skipped. Default: false.
--th                    Number of threads. Default: 1.
--trtpf                 File name of the parsed training data (if already available)
--tetpf                File name of the parsed test data (if already available)

Here is a sample code snippet to run the system -

import org.fbk.it.hlt.bioEntityExtractor.*;

public class test {

/**
* @param args
* @throws Exception 
*/
public static void main(String[] args) throws Exception {

String[] arguments = 
new String[]{
"--train-id-sen", "/media/Study/workspace/data/azdc/sentencesWithId.azdc",
"--n", "10",
"--train-entity", "/media/Study/workspace/data/azdc/disease.annotations.azdc",
"--fold-path", "/media/Study/workspace_2/exp_banner/folds_by_abstract/sen"
};
new BioEntityLocator().run(arguments);
}
}


Disease Dictionary:

A pre-processed disease entity dictionary (collected from UMLS Metathesaurus) is included inside the /db folder of the zipped folder. Please refer to the BioNlp'2010 paper (mentioned earlier) for details about how the dictionary has been created.

Syntactic Parser:

Note: We use Stanford parser for syntactic parsing which takes some time for parsing data (if data are not parsed already).

The original system described in the BioNlp'2010 paper uses GeniaTagger for tokenization and POS tagging before parsing using Stanford parser. For simplicity, we dropped it and using Stanford parser for tokenization and POS tagging as well. This might result in negligible lower performance.

Mailing List:

We encourage you to subscribe to the following mailing list if you want to get information about future release and updates -

https://lists.sourceforge.net/lists/listinfo/bioenex-bioenex

SelectionFile type iconFile nameDescriptionSizeRevisionTimeUser
ċ

Download
  177k v. 1 Apr 4, 2012, 1:31 AM Faisal Chowdhury
ċ

Download
  772k v. 1 Jun 15, 2012, 3:43 PM Faisal Chowdhury
ċ

Download
  125k v. 1 Jun 11, 2012, 3:39 AM Faisal Chowdhury
Comments