Suppr超能文献

使用IDTAXA对蛋白质编码序列进行准确注释。

Accurate annotation of protein coding sequences with IDTAXA.

作者信息

Cooley Nicholas P, Wright Erik S

机构信息

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA.

出版信息

NAR Genom Bioinform. 2021 Sep 16;3(3):lqab080. doi: 10.1093/nargab/lqab080. eCollection 2021 Sep.

Abstract

The observed diversity of protein coding sequences continues to increase far more rapidly than knowledge of their functions, making classification algorithms essential for assigning a function to proteins using only their sequence. Most pipelines for annotating proteins rely on searches for homologous sequences in databases of previously annotated proteins using BLAST or HMMER. Here, we develop a new approach for classifying proteins into a taxonomy of functions and demonstrate its utility for genome annotation. Our algorithm, IDTAXA, was more accurate than BLAST or HMMER at assigning sequences to KEGG ortholog groups. Moreover, IDTAXA correctly avoided classifying sequences with novel functions to existing groups, which is a common error mode for classification approaches that rely on E-values as a proxy for confidence. We demonstrate IDTAXA's utility for annotating eukaryotic and prokaryotic genomes by assigning functions to proteins within a multi-level ontology and applied IDTAXA to detect genome contamination in eukaryotic genomes. Finally, we re-annotated 8604 microbial genomes with known antibiotic resistance phenotypes to discover two novel associations between proteins and antibiotic resistance. IDTAXA is available as a web tool (http://DECIPHER.codes/Classification.html) or as part of the open source DECIPHER R package from Bioconductor.

摘要

已观察到的蛋白质编码序列的多样性增长速度,持续远远快于我们对其功能的了解,这使得分类算法对于仅依据蛋白质序列来赋予其功能至关重要。大多数蛋白质注释流程依赖于使用BLAST或HMMER在先前已注释蛋白质的数据库中搜索同源序列。在此,我们开发了一种将蛋白质分类到功能分类体系中的新方法,并展示了其在基因组注释中的效用。我们的算法IDTAXA在将序列分配到KEGG直系同源组方面比BLAST或HMMER更准确。此外,IDTAXA正确地避免了将具有新功能的序列分类到现有组中,而这是依赖E值作为置信度代理的分类方法常见的错误模式。我们通过在多层次本体中为蛋白质赋予功能,展示了IDTAXA在注释真核生物和原核生物基因组方面的效用,并应用IDTAXA检测真核生物基因组中的基因组污染。最后,我们对8604个具有已知抗生素抗性表型的微生物基因组进行了重新注释,以发现蛋白质与抗生素抗性之间的两个新关联。IDTAXA可作为网络工具(http://DECIPHER.codes/Classification.html)获取,也可作为来自Bioconductor的开源DECIPHER R包的一部分。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验