Bible Paul W, Sun Hong-Wei, Morasso Maria I, Loganantharaj Rasiah, Wei Lai
State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China.
Biodata Mining and Discovery Section, Office of Science and Technology, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, Maryland.
Comput Struct Biotechnol J. 2017 Jan 30;15:195-211. doi: 10.1016/j.csbj.2017.01.009. eCollection 2017.
The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally evaluate whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python interface (github.com/paulbible/ggtk).
描述基因功能的结构化词汇表——基因本体论(GO),是生物学研究中的一个强大工具。GO在计算生物学中的一个应用是计算两个概念之间的语义相似性,以推断基因的功能相似性。一类术语相似性算法明确计算概念之间的共享信息(SI),然后将此计算代入传统的术语相似性度量,如雷斯尼克(Resnik)、林(Lin)和蒋 - 康拉特(Jiang - Conrath)度量。当与本体选择和术语相似性类型相结合时,不同的SI方法会产生许多基因对基因的相似性度量。对于源自不同SI方法的语义方法的行为、复杂性和性能,尚未进行全面研究。我们应用自举法在六个基准上比较57种基因对基因语义度量的广义性能。考虑到度量的数量,我们还评估了这些方法是否可以通过集成机器学习来提高预测性能。结果表明,在所有评估中,本体类型的选择对性能影响最大。将度量组合成一个集成分类器可降低蛋白质相互作用预测的交叉验证误差,超过任何单个度量。这种改进源于通过本体类型组合获得的信息,因为在每个GO类型中使用集成方法并没有带来改进。这些结果表明,通过整合来自不同本体的方法,多种SI度量可用于机器学习任务,如自动基因功能预测。为了促进该领域未来的研究,我们开发了GO图工具包(GGTK),这是一个带有Python接口的开源C++库(github.com/paulbible/ggtk)。