Suppr超能文献

利用生物医学文献的最大熵分析将基因与基因本体编码相关联。

Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature.

作者信息

Raychaudhuri Soumya, Chang Jeffrey T, Sutphin Patrick D, Altman Russ B

机构信息

Department of Genetics, Stanford University, Stanford, California 94305, USA.

出版信息

Genome Res. 2002 Jan;12(1):203-14. doi: 10.1101/gr.199701.

Abstract

Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

摘要

许多物种的数千种基因产物的功能特征在已发表的文献中有所描述。这些讨论不仅对于表征这些基因产物的功能,而且对于表征它们在其他生物体中的同源物的功能都极具价值。基因本体论(GO)旨在创建一种受控术语,以便以更精确、可靠且计算机可读的方式标记基因功能。目前,使用GO对基因功能进行的最佳注释是由训练有素的生物学家通过阅读文献并选择适当的代码来完成的。在本研究中,我们探讨了使用统计自然语言处理技术来分配GO代码的可能性。我们将三种文档分类方法(最大熵建模、朴素贝叶斯分类和最近邻分类)应用于将一组GO代码(用于生物过程)与文献摘要相关联,进而与摘要相关的基因相关联的问题。我们表明,在确定摘要中讨论的功能时,最大熵建模优于其他方法,准确率达到72%。最大熵方法提供的置信度度量与性能密切相关。我们得出结论,统计方法可用于分配GO代码,并且随着术语标准随时间演变,对于重新分配这一艰巨任务可能会很有用。

相似文献

9
Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

引用本文的文献

6
Computational algorithms to predict Gene Ontology annotations.预测基因本体注释的计算算法。
BMC Bioinformatics. 2015;16 Suppl 6(Suppl 6):S4. doi: 10.1186/1471-2105-16-S6-S4. Epub 2015 Apr 17.

本文引用的文献

1
Information access. Building a "GenBank" of the published literature.
Science. 2001 Mar 23;291(5512):2318-9. doi: 10.1126/science.291.5512.2318b.
2
Detecting gene relations from Medline abstracts.从医学在线摘要中检测基因关系。
Pac Symp Biocomput. 2001:483-95. doi: 10.1142/9789814447362_0047.
3
The EMOTIF database.EMOTIF数据库。
Nucleic Acids Res. 2001 Jan 1;29(1):202-4. doi: 10.1093/nar/29.1.202.
8
Automatic extraction of protein interactions from scientific abstracts.
Pac Symp Biocomput. 2000:541-52. doi: 10.1142/9789814447331_0051.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验