Department of Computer Science, Wayne State University, 5143 Cass Ave., Detroit, MI 48202, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):91-9. doi: 10.1109/TCBB.2008.29.
The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach.
许多分子生物学实验的正确解释在很大程度上依赖于现有注释数据库的准确性和一致性。这些数据库旨在作为我们获取和完善生物知识的知识库。因此,从定义上讲,它们在任何给定的时间都是不完整的。在本文中,我们描述了一种技术,通过提取基因和功能之间的隐含语义关系,改进了我们以前用于预测新的 GO 注释的方法。在这项工作中,我们除了使用先前的潜在语义索引方法之外,还使用了向量空间模型和几种加权方案。这里描述的技术能够考虑到基因本体论 (GO) 的层次结构,并可以对位于不同深度的 GO 术语进行不同的加权。比较和评估了 15 种不同加权方案的预测能力。其中 9 种方案以前在其他问题领域中使用过,而其中 6 种是本文中引入的。最佳加权方案是一种新方案 n2tn。在使用这种加权方案预测的前 50 个功能注释中,我们在文献中找到了 84%的注释的支持,而 6%的预测与现有文献相矛盾。对于剩下的 10%,我们没有找到任何相关的出版物来证实或反驳这些预测。n2tn 加权方案也优于我们以前方法中使用的简单二进制方案。