Southwest University, Beibei and South China University of Technology, Guangzhou.
George Mason University, Fairfax.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1045-57. doi: 10.1109/TCBB.2013.111.
High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations. Next, these kernels are linearly (or nonlinearly) combined into a composite kernel. The composite kernel is utilized to develop a predictive model to infer the function of proteins. A protein can have multiple roles and functions (or labels). Therefore, multilabel learning methods are also adapted for protein function prediction. We develop a transductive multilabel classifier (TMC) to predict multiple functions of proteins using several unlabeled proteins. We also propose a method called transductive multilabel ensemble classifier (TMEC) for integrating the different data sources using an ensemble approach. The TMEC trains a graph-based multilabel classifier on each single data source, and then combines the predictions of the individual classifiers. We use a directed birelational graph to capture the relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the effectiveness of the TMC and TMEC to predict the functions of proteins on three benchmarks. We show that our approaches perform better than recently proposed protein function prediction methods on composite and multiple kernels. The code, data sets used in this paper and supplemental material are available at https://sites.google.com/site/guoxian85/tmec.
高通量实验技术产生了多种异质蛋白质组学和基因组数据集。为了对蛋白质进行计算注释,有必要并有望整合这些异质数据源。一些方法将这些数据源转换为不同的核或特征表示。接下来,这些核被线性(或非线性)组合成一个复合核。复合核用于开发预测模型来推断蛋白质的功能。一个蛋白质可以具有多种角色和功能(或标签)。因此,多标签学习方法也适用于蛋白质功能预测。我们开发了一种转导多标签分类器(TMC),使用几个未标记的蛋白质来预测蛋白质的多个功能。我们还提出了一种称为转导多标签集成分类器(TMEC)的方法,用于使用集成方法整合不同的数据源。TMEC 在每个单数据源上训练基于图的多标签分类器,然后组合各个分类器的预测。我们使用有向双关系图来捕获蛋白质之间、功能之间以及蛋白质和功能之间的关系。我们在三个基准上评估 TMC 和 TMEC 预测蛋白质功能的有效性。我们表明,我们的方法在复合核和多核上的性能优于最近提出的蛋白质功能预测方法。本文使用的代码、数据集和补充材料可在 https://sites.google.com/site/guoxian85/tmec 获得。