Pavlidis Paul, Gillis Jesse
Centre for High-Throughput Biology and Department of Psychiatry, University of British Columbia, Vancouver, V6T1Z4, Canada.
Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY, 11797, USA.
F1000Res. 2012 Sep 7;1:14. doi: 10.12688/f1000research.1-14.v1. eCollection 2012.
In this opinion piece, we attempt to unify recent arguments we have made that serious confounds affect the use of network data to predict and characterize gene function. The development of computational approaches to determine gene function is a major strand of computational genomics research. However, progress beyond using BLAST to transfer annotations has been surprisingly slow. We have previously argued that a large part of the reported success in using "guilt by association" in network data is due to the tendency of methods to simply assign new functions to already well-annotated genes. While such predictions will tend to be correct, they are generic; it is true, but not very helpful, that a gene with many functions is more likely to have any function. We have also presented evidence that much of the remaining performance in cross-validation cannot be usefully generalized to new predictions, making progressive improvement in analysis difficult to engineer. Here we summarize our findings about how these problems will affect network analysis, discuss some ongoing responses within the field to these issues, and consolidate some recommendations and speculation, which we hope will modestly increase the reliability and specificity of gene function prediction.
在这篇观点文章中,我们试图整合近期我们所提出的观点,即严重的混杂因素影响了利用网络数据预测和描述基因功能的过程。确定基因功能的计算方法的发展是计算基因组学研究的一个主要方向。然而,除了使用BLAST来转移注释之外,进展一直出奇地缓慢。我们之前曾指出,在网络数据中使用“关联有罪”方法所报告的大部分成功,是由于方法倾向于简单地将新功能分配给已经注释完善的基因。虽然这样的预测往往是正确的,但它们是一般性的;一个具有多种功能的基因更有可能具有任何一种功能,这是事实,但并没有太大帮助。我们还提供了证据表明,交叉验证中剩余的大部分性能无法有效地推广到新的预测中,这使得分析的逐步改进难以实现。在此,我们总结关于这些问题将如何影响网络分析的研究结果,讨论该领域对这些问题正在进行的一些应对措施,并整合一些建议和推测,我们希望这些将适度提高基因功能预测的可靠性和特异性。