公共数据库中异常、不完整和预测错误蛋白质的识别与校正。

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

作者信息

Nagy Alinda, Hegyi Hédi, Farkas Krisztina, Tordai Hedvig, Kozma Evelin, Bányai László, Patthy László

机构信息

Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, H-1113 Budapest, Hungary.

出版信息

BMC Bioinformatics. 2008 Aug 27;9:353. doi: 10.1186/1471-2105-9-353.

DOI:10.1186/1471-2105-9-353

PMID:18752676

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2542381/

Abstract

BACKGROUND

Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.

RESULTS

Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.

CONCLUSION

MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

摘要

背景

尽管基因组的计算注释有了显著改进，但公共数据库中异常、不完整或预测错误的基因和蛋白质序列仍然大量存在。由于大多数不完整、异常或预测错误的条目并未如此标注，这些错误严重影响了这些数据库的可靠性。在此，我们描述了MisPred方法，它可能为数据库的质量控制提供一种有效手段。MisPred方法的当前版本使用五个不同的程序，基于如果一个序列的某些特征与我们当前关于蛋白质编码基因和蛋白质的知识相冲突，那么该序列可能是错误的这一原则，来识别异常、不完整或预测错误的条目：（i）蛋白质预测的亚细胞定位与相应序列信号缺失之间的冲突；（ii）存在细胞外和细胞质结构域但不存在跨膜区段；（iii）细胞外和核结构域同时出现；（iv）违反结构域完整性；（v）由位于不同染色体上的两个或更多基因编码的嵌合体。

结果

对九种后口动物（智人、小家鼠、褐家鼠、家鼩、原鸡、热带爪蟾、红鳍东方鲀、斑马鱼和玻璃海鞘）和两种原口动物（秀丽隐杆线虫和黑腹果蝇）的预测EnsEMBL蛋白质序列进行分析后发现，预期信号肽的缺失和结构域完整性的违反占大多数预测错误。对NCBI的GNOMON注释管道预测的序列进行分析表明，预测错误率与EnsEMBL的相当。有趣的是，即使是经过人工整理的UniProtKB/Swiss-Prot数据集也被预测错误或异常的蛋白质污染，尽管程度远低于UniProtKB/TrEMBL或EnsEMBL或GNOMON预测的条目。

结论

MisPred在识别由最可靠的基因预测工具（如EnsEMBL和NCBI的GNOMON管道）生成的预测错误方面工作高效，并且还指导错误的纠正。我们建议应用MisPred方法将显著提高基因预测及相关数据库的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac76/2542381/89e744146c36/1471-2105-9-353-1.jpg

相似文献

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.公共数据库中异常、不完整和预测错误蛋白质的识别与校正。

BMC Bioinformatics. 2008 Aug 27;9:353. doi: 10.1186/1471-2105-9-353.

FixPred: a resource for correction of erroneous protein sequences.FixPred：一个用于纠正错误蛋白质序列的资源。

Database (Oxford). 2014 Apr 4;2014:bau032. doi: 10.1093/database/bau032. Print 2014.

Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.重新评估后生动物蛋白结构域架构进化：基因预测错误的主要影响。

Genes (Basel). 2011 Jul 13;2(3):449-501. doi: 10.3390/genes2030449.

MisPred: a resource for identification of erroneous protein sequences in public databases.MisPred：公共数据库中错误蛋白质序列鉴定资源。

Database (Oxford). 2013 Jul 17;2013:bat053. doi: 10.1093/database/bat053. Print 2013.

UniSave: the UniProtKB sequence/annotation version database.UniSave：UniProtKB序列/注释版本数据库。

Bioinformatics. 2006 May 15;22(10):1284-5. doi: 10.1093/bioinformatics/btl105. Epub 2006 Mar 21.

Reassessing domain architecture evolution of metazoan proteins: major impact of errors caused by confusing paralogs and epaktologs.重新评估后生动物蛋白结构域架构的进化：由混淆的旁系同源物和错配同源物引起的错误的重大影响。

Genes (Basel). 2011 Aug 2;2(3):516-61. doi: 10.3390/genes2030516.

Characterizing gene sets with FuncAssociate.使用FuncAssociate对基因集进行特征描述。

Bioinformatics. 2003 Dec 12;19(18):2502-4. doi: 10.1093/bioinformatics/btg363.

Mining sequence annotation databanks for association patterns.挖掘序列注释数据库中的关联模式。

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii49-57. doi: 10.1093/bioinformatics/bti1206.

Seq2Struct: a resource for establishing sequence-structure links.Seq2Struct：一个用于建立序列-结构联系的资源。

Bioinformatics. 2005 Feb 15;21(4):551-3. doi: 10.1093/bioinformatics/bti049. Epub 2004 Sep 28.

Identification and Correction of Erroneous Protein Sequences in Public Databases.公共数据库中错误蛋白质序列的识别与校正

Methods Mol Biol. 2016;1415:179-92. doi: 10.1007/978-1-4939-3572-7_9.

引用本文的文献

Cooperation of Spaln and Prrn5 for Construction of Gene-Structure-Aware Multiple Sequence Alignment.Spaln和Prrn5在构建基因结构感知多序列比对中的合作。

Methods Mol Biol. 2021;2231:71-88. doi: 10.1007/978-1-0716-1036-7_5.

Detecting and correcting misclassified sequences in the large-scale public databases.检测和纠正大规模公共数据库中的错误分类序列。

Bioinformatics. 2020 Sep 15;36(18):4699-4705. doi: 10.1093/bioinformatics/btaa586.

Interactome-Seq: A Protocol for Domainome Library Construction, Validation and Selection by Phage Display and Next Generation Sequencing.相互作用组测序：一种通过噬菌体展示和新一代测序构建、验证和筛选结构域组文库的方法。

J Vis Exp. 2018 Oct 3(140):56981. doi: 10.3791/56981.

Morphological Stasis and Proteome Innovation in Cephalochordates.头索动物的形态停滞与蛋白质组创新

Genes (Basel). 2018 Jul 16;9(7):353. doi: 10.3390/genes9070353.

Improved strategy for the curation and classification of kinases, with broad applicability to other eukaryotic protein groups.改进了激酶的策管和分类策略，对其他真核蛋白组具有广泛的适用性。

Sci Rep. 2018 May 1;8(1):6808. doi: 10.1038/s41598-018-25020-8.

VirusSeeker, a computational pipeline for virus discovery and virome composition analysis.VirusSeeker，一种用于病毒发现和病毒群落组成分析的计算流程。

Virology. 2017 Mar;503:21-30. doi: 10.1016/j.virol.2017.01.005. Epub 2017 Jan 18.

Advantages of an Improved Rhesus Macaque Genome for Evolutionary Analyses.用于进化分析的改良恒河猴基因组的优势

PLoS One. 2016 Dec 2;11(12):e0167376. doi: 10.1371/journal.pone.0167376. eCollection 2016.

DASP3: identification of protein sequences belonging to functionally relevant groups.DASP3：属于功能相关组的蛋白质序列的鉴定

BMC Bioinformatics. 2016 Nov 11;17(1):458. doi: 10.1186/s12859-016-1295-z.

Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors.文昌鱼蛋白质组创新的假定极高发生率可能是由基因预测错误的高发生率所解释的。

Sci Rep. 2016 Aug 1;6:30700. doi: 10.1038/srep30700.

Identification and Biochemical Properties of Two New Acetylcholinesterases in the Pond Wolf Spider (Pardosa pseudoannulata).拟环纹豹蛛体内两种新型乙酰胆碱酯酶的鉴定及生化特性

PLoS One. 2016 Jun 23;11(6):e0158011. doi: 10.1371/journal.pone.0158011. eCollection 2016.

本文引用的文献

Towards defining the nuclear proteome.向着定义核蛋白质组迈进。

Genome Biol. 2008 Jan 23;9(1):R15. doi: 10.1186/gb-2008-9-1-r15.

Steady progress and recent breakthroughs in the accuracy of automated genome annotation.自动基因组注释准确性方面的稳步进展和近期突破。

Nat Rev Genet. 2008 Jan;9(1):62-73. doi: 10.1038/nrg2220.

Genetics. Working the (gene count) numbers: finally, a firm answer?遗传学。统计（基因数量）数据：终于有确切答案了？

Science. 2007 May 25;316(5828):1113. doi: 10.1126/science.316.5828.1113a.

Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server.跨膜拓扑结构与信号肽联合预测的优势——Phobius网络服务器

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W429-32. doi: 10.1093/nar/gkm256. Epub 2007 May 5.

The implications of alternative splicing in the ENCODE protein complement.可变剪接在ENCODE蛋白质编码中的意义。

Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500. doi: 10.1073/pnas.0700800104. Epub 2007 Mar 19.

Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data.利用嵌合 EST 和 mRNA 数据进行转录诱导的染色体间相互作用的试探性作图。

PLoS One. 2007 Feb 28;2(2):e254. doi: 10.1371/journal.pone.0000254.

Long-term trends in evolution of indels in protein sequences.蛋白质序列中插入缺失（indels）进化的长期趋势。

BMC Evol Biol. 2007 Feb 13;7:19. doi: 10.1186/1471-2148-7-19.

The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection.小型天然存在蛋白质的高度协同折叠可能是自然选择的结果。

Cell. 2007 Feb 9;128(3):613-24. doi: 10.1016/j.cell.2006.12.042.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2007 Jan;35(Database issue):D5-12. doi: 10.1093/nar/gkl1031. Epub 2006 Dec 14.

Ensembl 2007.Ensembl 2007。

Nucleic Acids Res. 2007 Jan;35(Database issue):D610-7. doi: 10.1093/nar/gkl996. Epub 2006 Dec 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

公共数据库中异常、不完整和预测错误蛋白质的识别与校正。

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献