Saric Jasmin, Jensen Lars J, Rojas Isabel
European Media Laboratory GmbH, D-69118 Heidelberg, Germany.
In Silico Biol. 2005;5(1):21-32.
This paper presents an approach using syntactosemantic rules for the extraction of relational information from biomedical abstracts. The results show that by overcoming the hurdle of technical terminology, high precision results can be achieved. From abstracts related to baker's yeast, we manage to extract a regulatory network comprised of 441 pairwise relations from 58,664 abstracts with an accuracy of 83 - 90%. To achieve this, we made use of a resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. This list of names was included in the lexicon of our retrained partof- speech tagger for use on molecular biology abstracts. For the domain in question an accuracy of 93.6 - 97.7% was attained on Part-of-speech-tags. The method can be easily adapted to other organisms than yeast, allowing us to extract many more biologically relevant relations. The main reason for the comparable precision rates is the ontological model that was built beforehand and served as a guiding force for the manual coding of the syntactosemantic rules.
本文提出了一种利用句法语义规则从生物医学摘要中提取关系信息的方法。结果表明,通过克服技术术语的障碍,可以获得高精度的结果。从与面包酵母相关的摘要中,我们成功地从58664篇摘要中提取了一个由441对关系组成的调控网络,准确率为83%-90%。为了实现这一点,我们使用了一个比大多数其他生物相关信息提取方法所使用的基因/蛋白质名称资源大得多的资源。这个名称列表被包含在我们重新训练的词性标注器的词汇表中,用于分子生物学摘要。对于所讨论的领域,词性标注的准确率达到了93.6%-97.7%。该方法可以很容易地适用于酵母以外的其他生物体,使我们能够提取更多生物学相关的关系。可比精度率的主要原因是预先构建的本体模型,它作为句法语义规则手动编码的指导力量。