Department of Knowledge Technologies, JoŽef Stefan Institute, Jamova cesta 39, Ljubljana, Slovenia.
BMC Bioinformatics. 2013 Sep 26;14:285. doi: 10.1186/1471-2105-14-285.
Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers.
This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function.
Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.
基因功能的本体论和目录,如基因本体论(GO)和 MIPS-FUN,假设功能类别是按层次组织的,即一般功能包含更具体的功能。这最近激发了几种用于基因功能预测的机器学习算法的发展,这些算法利用了这种层次结构,其中实例可能属于多个类别。此外,还可以利用实例之间的关系,因为相关基因往往具有相似的功能注释是合理的。尽管这些关系已经在蛋白质-蛋白质相互作用(PPI)网络领域被识别和广泛研究,但它们在层次和多类基因功能预测中并没有得到太多关注。基因之间的关系在功能注释中引入了自相关性,并违反了实例是独立同分布(i.i.d.)的假设,这是大多数机器学习算法的基础。虽然明确考虑这些关系会给学习过程带来额外的复杂性,但我们预计在学习分类器的预测准确性方面会有实质性的好处。
本文展示了在多类基因功能预测中考虑自相关性的好处(以预测准确性为衡量标准)。我们开发了一种基于树的算法,用于在层次多标签分类(HMC)中考虑网络自相关性。我们使用每个 MIPS-FUN 和 GO 注释方案以及利用 2 个不同的 PPI 网络,在 12 个酵母数据集上对所提出的算法(称为 NHMC,即网络层次多标签分类)进行了实证评估。结果清楚地表明,考虑自相关性可以提高学习模型预测基因功能的预测性能。
我们新开发的 HMC 方法在学习阶段考虑了网络信息:当在 PPI 网络的背景下用于基因功能预测时,明确考虑网络自相关性会提高学习模型的预测性能。总的来说,我们发现这适用于不同的基因特征/描述、功能注释方案和 PPI 网络:当 PPI 网络密集且包含大量与功能相关的相互作用时,效果最佳。