Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK.
SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i210-i218. doi: 10.1093/bioinformatics/btaa466.
With the ever-increasing number and diversity of sequenced species, the challenge to characterize genes with functional information is even more important. In most species, this characterization almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The Critical Assessment of protein Function Annotation algorithms (CAFA) series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the open world assumption (OWA), leading to a systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations.
This article introduces a new, OWA-compliant, benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content of negative annotations. The benchmark has been tested using the naïve and BLAST baseline methods, as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments.
All data, as well as code used for analysis, is available from https://lab.dessimoz.org/20_not.
Supplementary data are available at Bioinformatics online.
随着测序物种数量和种类的不断增加,对具有功能信息的基因进行特征描述的挑战变得更加重要。在大多数物种中,这种特征描述几乎完全依赖于自动化的电子方法。因此,对各种方法进行基准测试至关重要。蛋白质功能注释算法的关键评估(CAFA)系列社区实验提供了最全面的基准,利用新整理的经过实验支持的注释进行延迟分析。然而,CAFA 中的假阳性定义并未完全考虑开放世界假设(OWA),导致精度被系统低估。这种限制的主要原因是缺乏负面实验注释。
本文介绍了一种新的、符合 OWA 的基准,该基准基于正、负注释的平衡测试集。负注释是从系统发育树上的专家 curated 注释的蛋白质家族中派生出来的。这种方法导致负注释的平均信息量大大增加。该基准已使用 naive 和 BLAST 基线方法以及两种基于同源性的方法进行了测试。这个新的基准可以在未来的 CAFA 实验中补充现有的基准。
所有数据以及用于分析的代码都可以从 https://lab.dessimoz.org/20_not 获得。
补充数据可在生物信息学在线获得。