Department of Computer Science, Università degli Studi di Milano, Via Comelico 39, Milano, 20135, Italy.
BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):417. doi: 10.1186/s12859-018-2385-x.
Supervised machine learning methods when applied to the problem of automated protein-function prediction (AFP) require the availability of both positive examples (i.e., proteins which are known to possess a given protein function) and negative examples (corresponding to proteins not associated with that function). Unfortunately, publicly available proteome and genome data sources such as the Gene Ontology rarely store the functions not possessed by a protein. Thus the negative selection, consisting in identifying informative negative examples, is currently a central and challenging problem in AFP. Several heuristics have been proposed through the years to solve this problem; nevertheless, despite their effectiveness, to the best of our knowledge no previous existing work studied which protein features are more relevant to this task, that is, which protein features help more in discriminating reliable and unreliable negatives.
The present work analyses the impact of several features on the selection of negative proteins for the Gene Ontology (GO) terms. The analysis is network-based: it exploits the fact that proteins can be naturally structured in a network, considering the pairwise relationships coming from several sources of data, such as protein-protein and genetic interactions. Overall, the proposed protein features, including local and global graph centrality measures and protein multifunctionality, can be term-aware (i.e., depending on the GO term) and term-unaware (i.e., invariant across the GO terms). We validated the informativeness of each feature utilizing a temporal holdout in three different experiments on yeast, mouse and human proteomes: (i) feature selection to detect which protein features are more helpful for the negative selection; (ii) protein function prediction to verify whether the features considered are also useful to predict GO terms; (iii) negative selection by applying two different negative selection algorithms on proteins represented through the proposed features.
Term-aware features (with some exceptions) resulted more informative for problem (i), together with node betweenness, which is the most relevant among term-unaware features. The node positive neighborhood instead is the most predictive feature for the AFP problem, while experiment (iii) showed that the proposed features allow negative selection algorithms to select effectively negative instances in the temporal holdout setting, with better results when nonlinear combinations of features are also exploited.
当应用于自动化蛋白质功能预测 (AFP) 问题的监督机器学习方法需要同时提供阳性示例(即已知具有特定蛋白质功能的蛋白质)和阴性示例(对应于不具有该功能的蛋白质)。不幸的是,公开的蛋白质组和基因组数据源(如基因本体论)很少存储蛋白质不具有的功能。因此,阴性选择,即确定有意义的阴性示例,目前是 AFP 的一个核心和具有挑战性的问题。多年来已经提出了几种启发式方法来解决这个问题;然而,尽管它们有效,但据我们所知,以前没有研究过哪些蛋白质特征与这个任务更相关,也就是说,哪些蛋白质特征更有助于区分可靠和不可靠的阴性示例。
本研究分析了几种特征对基因本体论 (GO) 术语中阴性蛋白质选择的影响。分析是基于网络的:它利用了蛋白质可以自然地在网络中结构化的事实,考虑了来自多个数据源的蛋白质之间的成对关系,例如蛋白质-蛋白质和遗传相互作用。总体而言,所提出的蛋白质特征,包括局部和全局图中心性度量和蛋白质多功能性,既可以是术语感知的(即依赖于 GO 术语),也可以是术语不可知的(即跨越 GO 术语不变)。我们利用酵母、小鼠和人类蛋白质组学中的三个不同实验的时间保留来验证每个特征的信息量:(i)特征选择,以检测哪些蛋白质特征对阴性选择更有帮助;(ii)蛋白质功能预测,以验证所考虑的特征是否也有助于预测 GO 术语;(iii)通过应用两种不同的基于所提出特征的蛋白质负选择算法来进行负选择。
术语感知特征(除了一些例外)对于问题 (i) 更具信息量,与节点介数一起,是术语不可知特征中最相关的。节点正邻居是 AFP 问题中最具预测性的特征,而实验 (iii) 表明,所提出的特征允许负选择算法在时间保留设置中有效地选择阴性实例,并且当还利用特征的非线性组合时,效果更好。