• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

理解真核生物蛋白质编码基因预测错误的原因:以灵长类蛋白质组为例。

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.

机构信息

Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.

出版信息

BMC Bioinformatics. 2020 Nov 10;21(1):513. doi: 10.1186/s12859-020-03855-1.

DOI:10.1186/s12859-020-03855-1
PMID:33172385
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7656754/
Abstract

BACKGROUND

Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses.

RESULTS

We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins.

CONCLUSIONS

Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

摘要

背景

测序技术的最新进展导致了可用基因组数量的爆炸式增长,但准确的基因组注释仍然是一个主要挑战。真核生物基因组中蛋白质编码基因的预测尤其成问题,因为它们具有复杂的外显子-内含子结构。即使是最好的真核生物基因预测算法也会犯严重错误,这将显著影响后续分析。

结果

我们首先在公共数据库中可用的十个灵长类动物蛋白质组的 176478 个蛋白质的大型集合中调查了基因预测错误的普遍性。使用研究充分的人类蛋白质作为参考,总共检测到 82305 个潜在的错误,包括 44001 个缺失、27289 个插入和 11015 个不匹配的片段,其中正确蛋白质序列的一部分被替换为替代错误序列。然后,我们专注于导致下游应用程序出现特殊问题的不匹配序列错误。详细的特征描述使我们能够确定这些情况下约一半(5446 个)基因误预测的潜在原因。作为概念验证,我们还开发了一种简单的方法,允许我们为 603 个灵长类动物蛋白质提出改进的序列。

结论

灵长类动物蛋白质组中的基因预测错误影响多达 50%的序列。错误的主要原因包括未确定的基因组区域、基因组测序或组装问题,以及用于表示基因外显子-内含子结构的模型的局限性。尽管如此,现有的基因组序列仍然可以被利用来提高蛋白质序列质量。这项工作的展望包括其他类型的基因预测错误的特征描述,以及更全面的蛋白质序列错误纠正算法的开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/ba480c8e9a7b/12859_2020_3855_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/60acafb2e675/12859_2020_3855_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/5dc17dc452dd/12859_2020_3855_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/aba2f2ef90bd/12859_2020_3855_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/68010d089e14/12859_2020_3855_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/04a8c6caa8a0/12859_2020_3855_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/ba480c8e9a7b/12859_2020_3855_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/60acafb2e675/12859_2020_3855_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/5dc17dc452dd/12859_2020_3855_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/aba2f2ef90bd/12859_2020_3855_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/68010d089e14/12859_2020_3855_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/04a8c6caa8a0/12859_2020_3855_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c22/7656754/ba480c8e9a7b/12859_2020_3855_Fig6_HTML.jpg

相似文献

1
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.理解真核生物蛋白质编码基因预测错误的原因:以灵长类蛋白质组为例。
BMC Bioinformatics. 2020 Nov 10;21(1):513. doi: 10.1186/s12859-020-03855-1.
2
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
3
4
OpenProt: a more comprehensive guide to explore eukaryotic coding potential and proteomes.OpenProt:探索真核生物编码潜能和蛋白质组的更全面指南。
Nucleic Acids Res. 2019 Jan 8;47(D1):D403-D410. doi: 10.1093/nar/gky936.
5
Identification and Correction of Erroneous Protein Sequences in Public Databases.公共数据库中错误蛋白质序列的识别与校正
Methods Mol Biol. 2016;1415:179-92. doi: 10.1007/978-1-4939-3572-7_9.
6
[Correction of five different types of errors of model REFSEQs appeared in NCBI human gene database only by using two novel human genes C17orf32 and ZNF362].[仅通过使用两个新的人类基因C17orf32和ZNF362校正出现在NCBI人类基因数据库中的五种不同类型的模型REFSEQs错误]
Yi Chuan Xue Bao. 2004 Apr;31(4):325-34.
7
Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species.西庇阿:利用蛋白质序列确定基因及其在近缘物种中的直系同源基因的精确外显子/内含子结构。
BMC Bioinformatics. 2008 Jun 13;9:278. doi: 10.1186/1471-2105-9-278.
8
Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution.人类与黑猩猩的比较基因组分析表明插入缺失在灵长类动物进化中起关键作用。
J Mol Evol. 2006 Nov;63(5):682-90. doi: 10.1007/s00239-006-0045-7. Epub 2006 Oct 29.
9
Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.基因组微生物编码序列的重新注释:发现新基因和注释不准确的基因。
BMC Bioinformatics. 2002;3:5. doi: 10.1186/1471-2105-3-5. Epub 2002 Feb 5.
10
Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.真还是假?衡量蛋白质注释错误对结构域获得和丢失事件估计的影响。
Front Bioinform. 2023 Apr 20;3:1178926. doi: 10.3389/fbinf.2023.1178926. eCollection 2023.

引用本文的文献

1
Predicting Protein Function in the AI and Big Data Era.人工智能与大数据时代的蛋白质功能预测
Biochemistry. 2025 Jun 3;64(11):2345-2352. doi: 10.1021/acs.biochem.5c00186. Epub 2025 May 17.
2
Toward a comprehensive profiling of alternative splicing proteoform structures, interactions and functions.迈向对可变剪接蛋白质异构体的结构、相互作用和功能进行全面分析。
Curr Opin Struct Biol. 2025 Feb;90:102979. doi: 10.1016/j.sbi.2024.102979. Epub 2025 Jan 7.
3
The nature and distribution of putative non-functional alleles suggest only two independent events at the origins of Astyanax mexicanus cavefish populations.

本文引用的文献

1
MISTIC: A prediction tool to reveal disease-relevant deleterious missense variants.MISTIC:一种预测工具,可揭示与疾病相关的有害错义变异。
PLoS One. 2020 Jul 31;15(7):e0236962. doi: 10.1371/journal.pone.0236962. eCollection 2020.
2
A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms.不同真核生物中从头基因预测方法的基准研究。
BMC Genomics. 2020 Apr 9;21(1):293. doi: 10.1186/s12864-020-6707-9.
3
The Alliance of Genome Resources: Building a Modern Data Ecosystem for Model Organism Databases.
假定非功能等位基因的性质和分布表明,墨西哥脂鲤洞穴种群起源只有两次独立事件。
BMC Ecol Evol. 2024 Apr 1;24(1):41. doi: 10.1186/s12862-024-02226-1.
4
Deep proteome coverage advances knowledge of Treponema pallidum protein expression profiles during infection.深度蛋白质组覆盖范围提高了苍白密螺旋体感染过程中蛋白表达谱的知识。
Sci Rep. 2023 Oct 25;13(1):18259. doi: 10.1038/s41598-023-45219-8.
5
Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes.欢迎来到大叶植物:改善非模式植物基因组注释的最佳实践。
Appl Plant Sci. 2023 Aug 8;11(4):e11533. doi: 10.1002/aps3.11533. eCollection 2023 Jul-Aug.
6
Pipeline for transferring annotations between proteins beyond globular domains.球状结构域之外的蛋白质间注释转移流水线。
Protein Sci. 2023 Jul;32(7):e4655. doi: 10.1002/pro.4655.
7
Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.真还是假?衡量蛋白质注释错误对结构域获得和丢失事件估计的影响。
Front Bioinform. 2023 Apr 20;3:1178926. doi: 10.3389/fbinf.2023.1178926. eCollection 2023.
8
CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach.CeGAL:利用计算机错误追踪方法重新定义一个广泛存在的真菌特异性转录因子家族
J Fungi (Basel). 2023 Mar 29;9(4):424. doi: 10.3390/jof9040424.
9
Functional characterization of prokaryotic dark matter: the road so far and what lies ahead.原核生物暗物质的功能表征:迄今为止的进展与未来展望。
Curr Res Microb Sci. 2022 Aug 7;3:100159. doi: 10.1016/j.crmicr.2022.100159. eCollection 2022.
10
Revised eutherian gene collections.经修订的真兽类基因集合。
BMC Genom Data. 2022 Jul 23;23(1):56. doi: 10.1186/s12863-022-01071-9.
基因组资源联盟:为模式生物数据库构建现代数据生态系统。
Genetics. 2019 Dec;213(4):1189-1196. doi: 10.1534/genetics.119.302523.
4
OrthoFinder: phylogenetic orthology inference for comparative genomics.OrthoFinder:用于比较基因组学的系统发育直系同源推断。
Genome Biol. 2019 Nov 14;20(1):238. doi: 10.1186/s13059-019-1832-y.
5
The neXtProt knowledgebase in 2020: data, tools and usability improvements.2020 年的 neXtProt 知识库:数据、工具和可用性改进。
Nucleic Acids Res. 2020 Jan 8;48(D1):D328-D334. doi: 10.1093/nar/gkz995.
6
Ensembl 2020.Ensembl 2020.
Nucleic Acids Res. 2020 Jan 8;48(D1):D682-D688. doi: 10.1093/nar/gkz966.
7
EnTAP: Bringing faster and smarter functional annotation to non-model eukaryotic transcriptomes.EnTAP:为非模式真核转录组带来更快更智能的功能注释。
Mol Ecol Resour. 2020 Mar;20(2):591-604. doi: 10.1111/1755-0998.13106. Epub 2019 Dec 31.
8
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.串联重复导致序列组装错误,并对基因组和蛋白质数据库提出了多层次的挑战。
Nucleic Acids Res. 2019 Dec 2;47(21):10994-11006. doi: 10.1093/nar/gkz841.
9
Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences.通过量化错误推断的缺失来衡量基因预测对真核生物中基因丢失估计的影响。
PLoS Comput Biol. 2019 Aug 28;15(8):e1007301. doi: 10.1371/journal.pcbi.1007301. eCollection 2019 Aug.
10
Mammalian Annotation Database for improved annotation and functional classification of Omics datasets from less well-annotated organisms.哺乳动物注释数据库,用于改进来自注释较少的生物体的组学数据集的注释和功能分类。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz086.