DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy.
DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy.
Comput Methods Programs Biomed. 2016 Apr;126:20-34. doi: 10.1016/j.cmpb.2015.12.002. Epub 2015 Dec 17.
Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount.
Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism.
We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted.
了解基因和蛋白质的功能对于理解生理和病理生物过程,以及开发新的药物和疗法至关重要。生物医学知识发现分析极大地受益于通过受控术语和本体表达的基因和蛋白质功能特征描述的可用性,即基因和蛋白质生物医学受控注释。在过去的几年中,已经有几个这样的注释数据库可用;然而,这些有价值的注释并不完整,包含错误,并且只有其中一些代表高度可靠的人类精心编辑的信息。因此,能够可靠地预测具有相关可能性值的新基因或蛋白质注释的计算技术是至关重要的。
在这里,我们提出了一种新的跨生物体学习方法,能够可靠地预测基于另一个进化上相关且研究得更好的生物体的基因的已知受控注释的生物体的基因的新功能。我们利用注释发现问题的新表示和可用受控注释的随机扰动来允许应用监督算法以高精度预测未知基因注释。利用为研究得较好的生物体提供的众多基因注释,我们的跨生物体学习方法创建和训练更好的预测模型,然后可以将其应用于预测目标生物体的新基因注释。
我们在五个进化上相关的生物体(智人、小家鼠、牛、鸡和盘基网柄菌)的不同基因注释数据集上测试并比较了我们的方法与等效的单生物体方法。结果表明,可用注释的扰动方法对于更好的预测模型训练是有用的,并且跨生物体模型相对于单生物体模型有很大的改进,而不受考虑的生物体之间的进化距离的影响。生成的可靠预测注释的排序列表,这些注释描述了新的基因功能,并具有相关的可能性值,对于补充可用注释非常有价值,可提高生物医学知识发现分析的覆盖范围,并且通过专注于优先预测的新注释来加快注释编辑过程。