Max-Delbrück Center for Molecular Medicine, Berlin, Germany.
BMC Bioinformatics. 2010 Feb 1;11:70. doi: 10.1186/1471-2105-11-70.
Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context.
We created a text mining system (LAITOR: Literature Assistant for Identification of Terms co-Occurrences and Relationships) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic.
Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds.
生物知识在科学文献中得到体现,这些文献通常根据基因/蛋白质(生物实体)的相互作用(生物相互作用)来描述其功能。这些生物实体通常与特定研究领域感兴趣的生物概念有关。因此,研究当前存储在公共数据库中的关于选定主题的文献,有助于生成将一组生物实体与共同背景联系起来的新假设。
我们创建了一个文本挖掘系统(LAITOR:用于识别术语共现和关系的文献助手),该系统分析 MEDLINE 摘要中生物实体、生物相互作用和其他生物术语的共现。该方法考虑了共现术语在句子或摘要中的位置。该系统在标准测试(BioCreative II IAS 测试数据)中检测到提及蛋白质-蛋白质相互作用的摘要,精度为 0.82-0.89,召回率为 0.48-0.70。我们说明了将 Laitor 应用于从与主题相关的 1000 个摘要数据集中检测植物响应基因的情况。
将提取相互作用的生物实体和生物概念与网络显示相结合的文本挖掘工具,可有助于在不同的科学背景下提出合理的假设。