Department of Information System, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):1059-69. doi: 10.1109/TCBB.2011.156.
Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL.
将未被阐明的蛋白质赋予生物学功能是后基因组时代的一个基本问题。大量的蛋白质-蛋白质相互作用(PPIs)数据的不断增加,导致了大量的计算方法的出现,用于在网络背景下确定蛋白质的功能。然而,这些算法将每个功能类别孤立地处理,因此经常受到标记数据匮乏的困难的影响。实际上,不同的功能类别是自然相互依赖的。我们提出了一种新的算法,多标签相关半监督学习(MCSL),通过利用 PPI 网络和功能类别网络提供的关系,将功能类别之间的内在相关性纳入到蛋白质功能预测中。指导思想是,分类函数在子图上应该足够平滑,这些子图的拓扑结构与这两个网络的拓扑结构非常匹配。我们将这种直觉编码为具有类内和类间一致性的正则化学习,这可以理解为基于图的学习与局部和全局一致性(LGC)方法的扩展。在酵母蛋白质组上的交叉验证表明,MCSL 始终优于几种最先进的方法。最值得注意的是,它有效地克服了标签数据匮乏的问题。补充文件可在 http://sites.google.com/site/csaijiang/MCSL 上免费获取。