Suppr超能文献

基于基因本体图的多标签文献分类

Multi-label literature classification based on the Gene Ontology graph.

作者信息

Jin Bo, Muller Brian, Zhai Chengxiang, Lu Xinghua

机构信息

Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC 29425, USA.

出版信息

BMC Bioinformatics. 2008 Dec 8;9:525. doi: 10.1186/1471-2105-9-525.

Abstract

BACKGROUND

The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.

RESULTS

In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.

CONCLUSION

Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

摘要

背景

基因本体论是一种以可计算形式表示与基因和蛋白质相关知识的受控词汇表。目前用基因本体论手动注释蛋白质的工作已跟不上文献中生物医学知识的积累速度,这促使开发文本挖掘方法,通过从文献中自动提取基因本体论注释来促进这一过程。该任务通常被视为文本分类问题,而当代方法面临训练数据不平衡以及与多标签分类相关的困难。

结果

在本研究中,我们研究了利用基因本体论图的结构增强生物医学文献自动多标签分类的方法。我们研究了三种基于图的多标签分类算法,包括一种新颖的随机算法和两种用于多标签文献分类的自上而下的层次分类方法。我们系统地评估并将这些基于图的分类算法与传统的扁平多标签算法进行了比较。结果表明,通过利用基因本体论图结构中的信息,基于图的多标签分类方法可以显著改善对分析文本所隐含的基因本体论术语的预测。此外,基于图的多标签分类器即使未能直接预测出真实注释,也能够(向注释者)建议与真实注释密切相关的基因本体论注释。一个实现所研究算法的软件包可供研究社区使用。

结论

通过利用基因本体论图结构中的信息,基于图的多标签分类方法在基于文献促进蛋白质注释方面比传统的扁平多标签分类方法具有更好的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eeca/2644325/b12de9532f3a/1471-2105-9-525-1.jpg

相似文献

1
Multi-label literature classification based on the Gene Ontology graph.
BMC Bioinformatics. 2008 Dec 8;9:525. doi: 10.1186/1471-2105-9-525.
2
Text categorization of biomedical data sets using graph kernels and a controlled vocabulary.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1211-7. doi: 10.1109/TCBB.2013.16.
3
Discovering gene annotations in biomedical text databases.
BMC Bioinformatics. 2008 Mar 6;9:143. doi: 10.1186/1471-2105-9-143.
4
G-Bean: an ontology-graph based web tool for biomedical literature retrieval.
BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-15-S12-S1. Epub 2014 Nov 6.
5
Cross-organism learning method to discover new gene functionalities.
Comput Methods Programs Biomed. 2016 Apr;126:20-34. doi: 10.1016/j.cmpb.2015.12.002. Epub 2015 Dec 17.
6
A relation based measure of semantic similarity for Gene Ontology annotations.
BMC Bioinformatics. 2008 Nov 4;9:468. doi: 10.1186/1471-2105-9-468.
7
MILANO--custom annotation of microarray results using automatic literature searches.
BMC Bioinformatics. 2005 Jan 20;6:12. doi: 10.1186/1471-2105-6-12.
8
A deep neural network based hierarchical multi-label classification method.
Rev Sci Instrum. 2020 Feb 1;91(2):024103. doi: 10.1063/1.5141161.
9
Relations as patterns: bridging the gap between OBO and OWL.
BMC Bioinformatics. 2010 Aug 31;11:441. doi: 10.1186/1471-2105-11-441.
10
Textpresso: an ontology-based information retrieval and extraction system for biological literature.
PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

引用本文的文献

1
Trans-species learning of cellular signaling systems with bimodal deep belief networks.
Bioinformatics. 2015 Sep 15;31(18):3008-15. doi: 10.1093/bioinformatics/btv315. Epub 2015 May 20.
2
Mapping annotations with textual evidence using an scLDA model.
AMIA Annu Symp Proc. 2011;2011:834-42. Epub 2011 Oct 22.
3
Identifying informative subsets of the Gene Ontology with information bottleneck methods.
Bioinformatics. 2010 Oct 1;26(19):2445-51. doi: 10.1093/bioinformatics/btq449. Epub 2010 Aug 11.

本文引用的文献

1
Literature-based concept profiles for gene annotation: the issue of weighting.
Int J Med Inform. 2008 May;77(5):354-62. doi: 10.1016/j.ijmedinf.2007.07.004. Epub 2007 Sep 10.
2
Manual curation is not sufficient for annotation of genomic databases.
Bioinformatics. 2007 Jul 1;23(13):i41-8. doi: 10.1093/bioinformatics/btm229.
3
4
GOAnnotator: linking protein GO annotations to evidence text.
J Biomed Discov Collab. 2006 Dec 20;1:19. doi: 10.1186/1747-5333-1-19.
5
Protein classification using ontology classification.
Bioinformatics. 2006 Jul 15;22(14):e530-8. doi: 10.1093/bioinformatics/btl208.
6
Enhancing text categorization with semantic-enriched representation and training data augmentation.
J Am Med Inform Assoc. 2006 Sep-Oct;13(5):526-35. doi: 10.1197/jamia.M2051. Epub 2006 Jun 23.
7
The TREC 2004 genomics track categorization task: classifying full text biomedical documents.
J Biomed Discov Collab. 2006 Mar 14;1:4. doi: 10.1186/1747-5333-1-4.
8
A categorization approach to automated ontological function annotation.
Protein Sci. 2006 Jun;15(6):1544-9. doi: 10.1110/ps.062184006. Epub 2006 May 2.
9
Identifying biological concepts from a protein-related corpus with a probabilistic topic model.
BMC Bioinformatics. 2006 Feb 8;7:58. doi: 10.1186/1471-2105-7-58.
10
Hierarchical multi-label prediction of gene function.
Bioinformatics. 2006 Apr 1;22(7):830-6. doi: 10.1093/bioinformatics/btk048. Epub 2006 Jan 12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验