Jin Bo, Muller Brian, Zhai Chengxiang, Lu Xinghua
Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC 29425, USA.
BMC Bioinformatics. 2008 Dec 8;9:525. doi: 10.1186/1471-2105-9-525.
The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.
In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.
Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
基因本体论是一种以可计算形式表示与基因和蛋白质相关知识的受控词汇表。目前用基因本体论手动注释蛋白质的工作已跟不上文献中生物医学知识的积累速度,这促使开发文本挖掘方法,通过从文献中自动提取基因本体论注释来促进这一过程。该任务通常被视为文本分类问题,而当代方法面临训练数据不平衡以及与多标签分类相关的困难。
在本研究中,我们研究了利用基因本体论图的结构增强生物医学文献自动多标签分类的方法。我们研究了三种基于图的多标签分类算法,包括一种新颖的随机算法和两种用于多标签文献分类的自上而下的层次分类方法。我们系统地评估并将这些基于图的分类算法与传统的扁平多标签算法进行了比较。结果表明,通过利用基因本体论图结构中的信息,基于图的多标签分类方法可以显著改善对分析文本所隐含的基因本体论术语的预测。此外,基于图的多标签分类器即使未能直接预测出真实注释,也能够(向注释者)建议与真实注释密切相关的基因本体论注释。一个实现所研究算法的软件包可供研究社区使用。
通过利用基因本体论图结构中的信息,基于图的多标签分类方法在基于文献促进蛋白质注释方面比传统的扁平多标签分类方法具有更好的潜力。