Suppr超能文献

基因的嵌合错误注释在真核非模式生物中仍然普遍存在。

Chimeric mis-annotations of genes remain pervasive in eukaryotic non-model organisms.

作者信息

Bachler Andreas, Walsh Thomas K, Rane Rahul V, Pandey Gunjan

机构信息

CSIRO, Black Mountain Laboratories, Clunies Ross Street, Canberra, ACT, 2601, Australia.

CSIRO, 351 Royal Parade, Parkville, VIC, 3052, Australia.

出版信息

BMC Genomics. 2025 Jul 1;26(1):630. doi: 10.1186/s12864-025-11765-w.

Abstract

BACKGROUND

Accurate annotation of protein-coding genes is critical for genome analysis in non-model organisms. However, limited RNA-Seq data and incomplete protein resources can lead to errors, including chimeric gene mis-annotations, where two or more adjacent genes are incorrectly fused into a single model. These errors often persist due to annotation inertia, where mistakes are propagated and amplified through data sharing and reanalysis, and leads to cases where the mis-annotated model becomes favoured over the correct model. This complicates almost all downstream genomic analyses such as gene expression studies and comparative genomics.

RESULTS

We investigated chimeric mis-annotations across 30 recently annotated genomes spanning invertebrates, vertebrates, and plants, identifying 605 confirmed cases. The majority of these errors occurred in invertebrates and plants. Using structural prediction and splicing assessment, we demonstrated that utilising machine-learning annotation tools (such as Helixer) provides an approach which can identify mis-annotations.

CONCLUSIONS

This study highlights the prevalence of chimeric mis-annotations in genomic datasets and showcases the potential of machine-learning tools such as Helixer to refine gene models for highly variable gene families with mis-annotations present in databases. By addressing these annotation errors, we improve genomic data reliability and facilitate a deeper understanding of non-model organisms.

摘要

背景

准确注释蛋白质编码基因对于非模式生物的基因组分析至关重要。然而,有限的RNA测序数据和不完整的蛋白质资源可能导致错误,包括嵌合基因错误注释,即两个或更多相邻基因被错误地融合成一个单一模型。由于注释惯性,这些错误往往会持续存在,错误会通过数据共享和重新分析而传播和放大,导致错误注释的模型比正确模型更受青睐的情况。这几乎使所有下游基因组分析(如基因表达研究和比较基因组学)变得复杂。

结果

我们调查了30个最近注释的跨越无脊椎动物、脊椎动物和植物的基因组中的嵌合错误注释,确定了605个确诊病例。这些错误大多发生在无脊椎动物和植物中。通过结构预测和剪接评估,我们证明使用机器学习注释工具(如Helixer)提供了一种可以识别错误注释的方法。

结论

本研究突出了基因组数据集中嵌合错误注释的普遍性,并展示了机器学习工具(如Helixer)在完善数据库中存在错误注释的高度可变基因家族的基因模型方面的潜力。通过解决这些注释错误,我们提高了基因组数据的可靠性,并促进了对非模式生物的更深入理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1123/12220653/75cae8b4d428/12864_2025_11765_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验