Saric Jasmin, Jensen Lars Juhl, Ouzounova Rossitza, Rojas Isabel, Bork Peer
EML Research gGmbH D-69118 Heidelberg, Germany.
Bioinformatics. 2006 Mar 15;22(6):645-50. doi: 10.1093/bioinformatics/bti597. Epub 2005 Jul 26.
We have previously developed a rule-based approach for extracting information on the regulation of gene expression in yeast. The biomedical literature, however, contains information on several other equally important regulatory mechanisms, in particular phosphorylation, which we now expanded for our rule-based system also to extract.
This paper presents new results for extraction of relational information from biomedical text. We have improved our system, STRING-IE, to capture both new types of linguistic constructs as well as new types of biological information [i.e. (de-)phosphorylation]. The precision remains stable with a slight increase in recall. From almost one million PubMed abstracts related to four model organisms, we manage to extract regulatory networks and binary phosphorylations comprising 3,319 relation chunks. The accuracy is 83-90% and 86-95% for gene expression and (de-)phosphorylation relations, respectively. To achieve this, we made use of an organism-specific resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. These names were included in the lexicon when retraining the part-of-speech (POS) tagger on the GENIA corpus. For the domain in question, an accuracy of 96.4% was attained on POS tags. It should be noted that the rules were developed for yeast and successfully applied to both abstracts and full-text articles related to other organisms with comparable accuracy.
The revised GENIA corpus, the POS tagger, the extraction rules and the full sets of extracted relations are available from http://www.bork.embl.de/Docu/STRING-IE
我们之前开发了一种基于规则的方法来提取酵母基因表达调控信息。然而,生物医学文献还包含其他几种同样重要的调控机制的信息,特别是磷酸化,我们现在将基于规则的系统进行扩展,使其也能提取磷酸化信息。
本文展示了从生物医学文本中提取关系信息的新成果。我们改进了系统STRING-IE,以捕获新型语言结构以及新型生物信息[即(去)磷酸化]。精确率保持稳定,召回率略有提高。从与四种模式生物相关的近一百万个PubMed摘要中,我们成功提取了包含3319个关系块的调控网络和二元磷酸化信息。基因表达关系和(去)磷酸化关系的准确率分别为83 - 90%和86 - 95%。为实现这一目标,我们利用了一种特定生物体的基因/蛋白质名称资源,其规模比大多数其他生物相关信息提取方法所使用的资源大得多。在GENIA语料库上重新训练词性(POS)标注器时,这些名称被纳入了词典。对于所讨论的领域,词性标注的准确率达到了96.4%。需要注意的是,这些规则是针对酵母开发的,并成功应用于与其他生物体相关的摘要和全文文章,且准确率相当。
修订后的GENIA语料库、词性标注器、提取规则以及完整的提取关系集可从http://www.bork.embl.de/Docu/STRING-IE获取。