Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
Bioinformatics. 2018 Jul 1;34(13):i52-i60. doi: 10.1093/bioinformatics/bty259.
Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications.
We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein-protein interactions in human and yeast. We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins. Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers. Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies. Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved.
https://github.com/bio-ontology-research-group/onto2vec.
Supplementary data are available at Bioinformatics online.
生物知识广泛以基于本体的注释形式表示:本体描述了假定存在于一个领域中的现象,注释将(某种)生物实体与该领域内的一组现象联系起来。本体及其注释中包含的结构和信息使其成为开发机器学习、数据分析和知识提取算法的宝贵资源;特别是,语义相似性被广泛用于识别生物实体之间的关系,并且本体注释经常被用作机器学习应用中的特征。
我们提出了 Onto2Vec 方法,这是一种基于生物实体对生物医学本体的注释来学习特征向量的方法。我们的方法可以应用于广泛的生物信息学研究问题,例如基于相似性预测蛋白质之间的相互作用、使用监督学习对相互作用类型进行分类,或聚类。为了评估 Onto2Vec,我们使用基因本体 (GO) 并共同生成蛋白质、它们被注释的 GO 类以及约束这些类的 GO 公理的密集向量表示。首先,我们证明 Onto2Vec 生成的特征向量可以显著提高人类和酵母中蛋白质-蛋白质相互作用的预测。然后,我们说明了 Onto2Vec 表示如何为构建基于数据的、可训练的语义相似性度量提供手段,该度量可用于识别蛋白质之间的特定关系。最后,我们使用无监督聚类方法根据它们的酶委员会编号识别蛋白质家族。我们的结果表明,Onto2Vec 可以从生物实体和本体中生成高质量的特征向量。Onto2Vec 有可能在涉及本体的几个预测应用中显著优于最新技术。
https://github.com/bio-ontology-research-group/onto2vec。
补充数据可在 Bioinformatics 在线获得。