Mi Huaiyu, Vandergriff Jody, Campbell Michael, Narechania Apurva, Majoros William, Lewis Suzanna, Thomas Paul D, Ashburner Michael
Protein Informatics, Celera Genomics, Foster City, California 94404, USA.
Genome Res. 2003 Sep;13(9):2118-28. doi: 10.1101/gr.771603.
The functional classification of genes on a genome-wide scale is now in its infancy, and we make a first attempt to assess existing methods and identify sources of error. To this end, we compared two independent efforts for associating proteins with functions, one implemented by FlyBase and the other by PANTHER at Celera Genomics. Both methods make inferences based on sequence similarity and the available experimental evidence. However, they differ considerably in methodology and process. Overall, assuming that the systematic error across the two methods is relatively small, we find the protein-to-function association error rate of both the FlyBase and PANTHER methods to be <2%. The primary source of error for both methods appears to be simple human error. Although homology-based inference can certainly cause errors in annotation, our analysis indicates that the frequency of such errors is relatively small compared with the number of correct inferences. Moreover, these homology errors can be minimized by careful tree-based inference, such as that implemented in PANTHER. Often, functional associations are made by one method and not the other, indicating that one of the greatest challenges lies in improving the completeness of available ontology associations.
全基因组范围内基因的功能分类目前尚处于起步阶段,我们首次尝试评估现有方法并识别错误来源。为此,我们比较了两项将蛋白质与功能相关联的独立工作,一项由FlyBase实施,另一项由赛雷拉基因组公司的PANTHER实施。两种方法均基于序列相似性和现有的实验证据进行推断。然而,它们在方法和过程上有很大差异。总体而言,假设两种方法之间的系统误差相对较小,我们发现FlyBase和PANTHER方法的蛋白质与功能关联错误率均<2%。两种方法的主要错误来源似乎都是简单的人为错误。虽然基于同源性的推断肯定会导致注释错误,但我们的分析表明,与正确推断的数量相比,此类错误的频率相对较小。此外,通过仔细的基于树的推断,如PANTHER中实施的推断,可以将这些同源性错误降至最低。通常,功能关联是通过一种方法而非另一种方法进行的,这表明最大的挑战之一在于提高可用本体关联的完整性。