Rogers Mark F, Ben-Hur Asa
Computer Science Department, Colorado State University, Ft. Collins, CO, USA.
Bioinformatics. 2009 May 1;25(9):1173-7. doi: 10.1093/bioinformatics/btp122. Epub 2009 Mar 2.
The biological community's reliance on computational annotations of protein function makes correct assessment of function prediction methods an issue of great importance. The fact that a large fraction of the annotations in current biological databases are based on computational methods can lead to bias in estimating the accuracy of function prediction methods. This can happen since predicting an annotation that was derived computationally in the first place is likely easier than predicting annotations that were derived experimentally, leading to over-optimistic classifier performance estimates.
We illustrate this phenomenon in a set of controlled experiments using a nearest neighbor classifier that uses PSI-BLAST similarity scores. Our results demonstrate that the source of Gene Ontology (GO) annotations used to assess a protein function predictor can have a highly significant influence on classifier accuracy: the average accuracy over four species and over GO terms in the biological process namespace increased from 0.72 to 0.87 when the classifier was given access to annotations that are assigned evidence codes that indicate a possible computational source, instead of experimentally determined annotations. Slightly smaller increases were observed in the other namespaces. In these comparisons the total number of annotations and their distribution across GO terms were kept the same.
In conclusion, taking into account GO evidence codes is required for reporting accuracy statistics that do not overestimate a model's performance, and is of particular importance for a fair comparison of classifiers that rely on different information sources.
Supplementary data are available at Bioinformatics online.
生物学界对蛋白质功能的计算注释的依赖使得正确评估功能预测方法成为一个极其重要的问题。当前生物学数据库中很大一部分注释基于计算方法这一事实,可能会导致在估计功能预测方法的准确性时产生偏差。之所以会这样,是因为预测最初通过计算得出的注释可能比预测通过实验得出的注释更容易,从而导致对分类器性能的估计过于乐观。
我们在一组使用基于PSI-BLAST相似性得分的最近邻分类器的对照实验中说明了这一现象。我们的结果表明,用于评估蛋白质功能预测器的基因本体论(GO)注释来源对分类器准确性可能有高度显著的影响:当分类器能够使用被赋予表明可能是计算来源的证据代码的注释,而非实验确定的注释时,在生物过程命名空间中,四个物种以及GO术语上的平均准确率从0.72提高到了0.87。在其他命名空间中也观察到了稍小幅度的提高。在这些比较中,注释的总数及其在GO术语上的分布保持不变。
总之,如果要报告不会高估模型性能的准确性统计数据,就需要考虑GO证据代码,这对于公平比较依赖不同信息来源的分类器尤为重要。
补充数据可在《生物信息学》在线获取。