Lehtinen Sonja, Lees Jon, Bähler Jürg, Shawe-Taylor John, Orengo Christine
CoMPLEX, University College London, London, United Kingdom; Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
PLoS One. 2015 Aug 19;10(8):e0134668. doi: 10.1371/journal.pone.0134668. eCollection 2015.
With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene function from these networks have been proposed. However, evaluating the relative performance of these algorithms may not be trivial: concerns have been raised over biases in different benchmarking methods and datasets, particularly relating to non-independence of functional association data and test data. In this paper we propose a new network-based gene function prediction algorithm using a commute-time kernel and partial least squares regression (Compass). We compare Compass to GeneMANIA, a leading network-based prediction algorithm, using a number of different benchmarks, and find that Compass outperforms GeneMANIA on these benchmarks. We also explicitly explore problems associated with the non-independence of functional association data and test data. We find that a benchmark based on the Gene Ontology database, which, directly or indirectly, incorporates information from other databases, may considerably overestimate the performance of algorithms exploiting functional association data for prediction.
随着大规模生物数据集越来越容易获取,从这些数据中提取具有功能意义信息的自动化方法正变得日益重要。与基因或蛋白质之间功能关联相关的数据,如共表达或功能关联,通常以基因或蛋白质网络的形式呈现。已经提出了几种从这些网络预测基因功能的方法。然而,评估这些算法的相对性能并非易事:不同的基准测试方法和数据集存在偏差的问题已被提出,特别是与功能关联数据和测试数据的非独立性有关。在本文中,我们提出了一种使用通勤时间核和偏最小二乘回归的基于网络的新基因功能预测算法(Compass)。我们使用多种不同的基准测试将Compass与领先的基于网络的预测算法GeneMANIA进行比较,发现Compass在这些基准测试上优于GeneMANIA。我们还明确探讨了与功能关联数据和测试数据的非独立性相关的问题。我们发现基于基因本体数据库的基准测试,该数据库直接或间接地纳入了来自其他数据库的信息,可能会大大高估利用功能关联数据进行预测的算法的性能。