Suppr超能文献

没有一种工具能够一统天下:原核生物基因预测工具的注释高度依赖于研究的生物体。

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study.

机构信息

Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, UK.

Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK.

出版信息

Bioinformatics. 2022 Feb 7;38(5):1198-1207. doi: 10.1093/bioinformatics/btab827.

Abstract

MOTIVATION

The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis.

RESULTS

We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.

AVAILABILITY AND IMPLEMENTATION

Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基于模式生物的历史基因组注释,编码序列(CDS)预测工具存在偏差,这影响了我们对新基因组和宏基因组的理解。这阻碍了新基因组信息的发现,因为这导致预测偏向于现有知识。迄今为止,用户缺乏一种系统的、可复制的方法来识别任何 CDS 预测工具的优缺点,并允许他们为自己的分析选择合适的工具。

结果

我们提出了一个评估框架(ORForise),该框架基于一套全面的 12 个主要和 60 个次要指标,有助于评估 CDS 预测工具的性能。这使得确定哪种工具更适合特定用例成为可能。我们使用它来评估 15 种从头预测和基于模型的工具,这些工具代表了历史上和目前最广泛使用的工具,用于生成基因组数据库中的知识。我们发现,任何工具的性能都取决于正在分析的基因组,而且没有一个工具在所有基因组或分析的指标上都被评为最准确的。即使是排名最高的工具也产生了相互矛盾的基因集,这些基因集无法通过聚合来解决。ORForise 评估框架为用户提供了一种可复制的、基于数据的方法,以便为新的基因组注释和历史注释的细化做出明智的工具选择。

可用性和实现

可在 https://github.com/NickJD/ORForise 获得用于重现和定制的代码和数据集。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c49/8825762/4bf2e4d44c57/btab827f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验