Tao Ying, Sam Lee, Li Jianrong, Friedman Carol, Lussier Yves A
Department of Biomedical Informatics, Columbia University, 622 West 168th Street, VC5, New York, NY 10032, USA.
Bioinformatics. 2007 Jul 1;23(13):i529-38. doi: 10.1093/bioinformatics/btm195.
Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes).
We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11,000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/.
Supplementary data are available at Bioinformatics online.
尽管基因注释过程取得了进展,但很大一部分基因产物的功能仍未得到充分表征。此外,针对部分已表征基因功能或过程的新型基因本体论(GO)注释的计算机预测高度依赖于反向遗传学或功能基因组学方法。据我们所知,尚未有预测方法被证明对于注释稀少的GO术语(与少于10个基因相关的术语)具有高度准确性。
我们提出了一种新方法,基于信息论的语义相似性(ITSS),以根据现有的GO注释自动预测基因的分子功能。使用10折交叉验证,我们证明在GO数据集注释密集的部分进行类似条件比较时,ITSS算法获得的预测准确率(精确率97%,召回率77%)与其他机器学习算法相当。该方法能够在GO注释稀少的部分生成高度准确的预测,而之前的算法在此处失败了。因此,我们的技术生成的功能预测比以前的方法多一个数量级。对于最近的GO注释(智人中约1400个GO术语和11000个基因)的注释稀少网络,10折交叉验证表明该算法在召回率为36%时精确率为90%。据我们所知,本文提出了对预测的GO注释的首次历史回滚验证,这可能比更广泛使用的交叉验证方法代表更现实的条件。通过手动评估在历史回滚评估中进行的100个预测的随机样本,我们估计对于2003年的人类GO注释文件,最低精确率可达51%(95%置信区间:43 - 58%)。
该程序可应要求提供。2005年GO注释数据集的97732个新基因注释的阳性预测及其他补充信息可在http://phenos.bsd.uchicago.edu/ITSS/获取。
补充数据可在《生物信息学》在线获取。