• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

AntiFam:一种帮助识别蛋白质注释中虚假开放阅读框的工具。

AntiFam: a tool to help identify spurious ORFs in protein annotation.

机构信息

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA. UK.

出版信息

Database (Oxford). 2012 Mar 20;2012:bas003. doi: 10.1093/database/bas003. Print 2012.

DOI:10.1093/database/bas003
PMID:22434837
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3308159/
Abstract

As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.

摘要

随着基因组 DNA 序列的大量涌现,经过人工整理的蛋白质序列所占的比例下降了。反过来,随着越来越多的实验室具备高通量测序的能力,这些实验室在准确识别和注释基因组内所有基因方面的信息能力可能常常不足。这些问题导致人们担心传递性注释错误会使序列数据库变得不可靠。在 Pfam 蛋白质家族数据库的生命周期中,已经构建了许多蛋白质家族,后来发现它们仅由虚假的开放阅读框 (ORF) 组成,这些 ORF 要么位于相反的链上,要么相对于真正的蛋白质编码或非编码 RNA 基因以不同的重叠阅读框存在。这些家族已被删除,不再可用于 Pfam。然而,我们意识到这些家族可能具有识别新的虚假 ORF 的有用功能。我们已经将这些家族与其他定制的虚假 ORF 家族一起收集在 AntiFam 中。该资源目前包含 23 个家族,在 UniProtKB 中鉴定出了 1310 个虚假蛋白质,在一组宏基因组序列中鉴定出了另外 4119 个虚假蛋白质。UniProt 已将 AntiFam 作为 UniProtKB 质量控制过程的一部分,并将对这些虚假蛋白质进行排除调查。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5edb/3308159/aa6247c3ec4b/bas003f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5edb/3308159/5734eda802af/bas003f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5edb/3308159/aa6247c3ec4b/bas003f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5edb/3308159/5734eda802af/bas003f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5edb/3308159/aa6247c3ec4b/bas003f2.jpg

相似文献

1
AntiFam: a tool to help identify spurious ORFs in protein annotation.AntiFam:一种帮助识别蛋白质注释中虚假开放阅读框的工具。
Database (Oxford). 2012 Mar 20;2012:bas003. doi: 10.1093/database/bas003. Print 2012.
2
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.使用Spurio进行基因预测:一种识别虚假蛋白质序列的工具。
F1000Res. 2018 Mar 2;7:261. doi: 10.12688/f1000research.14050.1. eCollection 2018.
3
UniProtKB/Swiss-Prot.通用蛋白质知识库/瑞士蛋白质数据库
Methods Mol Biol. 2007;406:89-112. doi: 10.1007/978-1-59745-535-0_4.
4
An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.通过蛋白质基因组学鉴定原核基因组全部蛋白质编码潜能的综合策略。
Genome Res. 2017 Dec;27(12):2083-2095. doi: 10.1101/gr.218255.116. Epub 2017 Nov 15.
5
Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes.Pinstripe:一套用于整合转录组和蛋白质组数据集的程序,可识别新的蛋白质,并提高蛋白质编码和非编码基因的区分能力。
Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7.
6
The Pfam protein families database: embracing AI/ML.Pfam蛋白质家族数据库:拥抱人工智能/机器学习。
Nucleic Acids Res. 2025 Jan 6;53(D1):D523-D534. doi: 10.1093/nar/gkae997.
7
The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库:迈向更可持续的未来。
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.
8
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
9
GENIUS II: a high-throughput database system for linking ORFs in complete genomes to known protein three-dimensional structures.GENIUS II:一个用于将完整基因组中的开放阅读框与已知蛋白质三维结构相链接的高通量数据库系统。
Bioinformatics. 2004 Mar 1;20(4):596-8. doi: 10.1093/bioinformatics/btg478. Epub 2004 Jan 29.
10
Pfam: the protein families database.Pfam:蛋白质家族数据库。
Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.

引用本文的文献

1
Metatranscriptomes-based sequence similarity networks uncover genetic signatures within parasitic freshwater microbial eukaryotes.基于宏转录组的序列相似性网络揭示了寄生淡水微生物真核生物中的遗传特征。
Microbiome. 2025 Feb 6;13(1):43. doi: 10.1186/s40168-024-02027-0.
2
sORFdb - a database for sORFs, small proteins, and small protein families in bacteria.sORFdb——一个关于细菌中短开放阅读框、小蛋白和小蛋白家族的数据库。
BMC Genomics. 2025 Feb 5;26(1):110. doi: 10.1186/s12864-025-11301-w.
3
mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies.

本文引用的文献

1
The Pfam protein families database.Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.
2
Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies.rRNA 的错误注释现在可能会导致宏转录组研究中 90%的假阳性蛋白质匹配。
Nucleic Acids Res. 2011 Nov 1;39(20):8792-802. doi: 10.1093/nar/gkr576. Epub 2011 Jul 19.
3
UniProt Knowledgebase: a hub of integrated protein data.UniProt 知识库:一个集成蛋白质数据的中心。
Mettannotator:一种用于原核生物组装的全面且可扩展的Nextflow注释管道。
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf037.
4
Revisiting the functional annotation of TriTryp using sequence similarity tools.使用序列相似性工具重新审视锥虫属的功能注释。
Heliyon. 2024 Oct 11;10(20):e39243. doi: 10.1016/j.heliyon.2024.e39243. eCollection 2024 Oct 30.
5
Repeat-induced point mutations driving Parastagonospora nodorum genomic diversity are balanced by selection against non-synonymous mutations.重复诱导的点突变驱动小麦根腐平脐蠕孢基因组多样性,这种多样性通过对非同义突变的选择而得到平衡。
Commun Biol. 2024 Dec 4;7(1):1614. doi: 10.1038/s42003-024-07327-7.
6
InterPro: the protein sequence classification resource in 2025.InterPro:2025年的蛋白质序列分类资源。
Nucleic Acids Res. 2025 Jan 6;53(D1):D444-D456. doi: 10.1093/nar/gkae1082.
7
A catalog of small proteins from the global microbiome.全球微生物组中的小分子蛋白质目录。
Nat Commun. 2024 Aug 31;15(1):7563. doi: 10.1038/s41467-024-51894-6.
8
Large-scale investigation of species-specific orphan genes in the human gut microbiome elucidates their evolutionary origins.大规模调查人类肠道微生物组中的物种特异性孤儿基因,阐明其进化起源。
Genome Res. 2024 Jul 23;34(6):888-903. doi: 10.1101/gr.278977.124.
9
Unveiling the microbial realm with VEBA 2.0: a modular bioinformatics suite for end-to-end genome-resolved prokaryotic, (micro)eukaryotic and viral multi-omics from either short- or long-read sequencing.揭示微生物世界的 VEBA 2.0:一个用于从短读或长读测序中进行端到端基因组解析的原核生物、(微)真核生物和病毒多组学的模块化生物信息学套件。
Nucleic Acids Res. 2024 Aug 12;52(14):e63. doi: 10.1093/nar/gkae528.
10
Discovery of antimicrobial peptides in the global microbiome with machine learning.利用机器学习在全球微生物组中发现抗菌肽。
Cell. 2024 Jul 11;187(14):3761-3778.e16. doi: 10.1016/j.cell.2024.05.013. Epub 2024 Jun 5.
Database (Oxford). 2011 Mar 29;2011:bar009. doi: 10.1093/database/bar009. Print 2011.
4
A new generation of homology search tools based on probabilistic inference.基于概率推理的新一代同源性搜索工具。
Genome Inform. 2009 Oct;23(1):205-11.
5
Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.公共数据库中的注释错误:酶超家族中分子功能的错误注释。
PLoS Comput Biol. 2009 Dec;5(12):e1000605. doi: 10.1371/journal.pcbi.1000605. Epub 2009 Dec 11.
6
The Pfam protein families database.Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.
7
Identifying bacterial genes and endosymbiont DNA with Glimmer.使用Glimmer识别细菌基因和内共生体DNA。
Bioinformatics. 2007 Mar 15;23(6):673-9. doi: 10.1093/bioinformatics/btm009. Epub 2007 Jan 19.
8
Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea.细菌和古细菌中交替开放阅读框的大规模多基因组分析。
OMICS. 2005 Spring;9(1):91-105. doi: 10.1089/omi.2005.9.91.
9
A combined transmembrane topology and signal peptide prediction method.一种跨膜拓扑结构与信号肽联合预测方法。
J Mol Biol. 2004 May 14;338(5):1027-36. doi: 10.1016/j.jmb.2004.03.016.
10
Errors in genome annotation.基因组注释中的错误。
Trends Genet. 1999 Apr;15(4):132-3. doi: 10.1016/s0168-9525(99)01706-0.