OMAmer：基于树的、无需比对的蛋白质亚家族分配方法优于最接近序列的方法。

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

机构信息

Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.

Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.

出版信息

Bioinformatics. 2021 Sep 29;37(18):2866-2873. doi: 10.1093/bioinformatics/btab219.

DOI:10.1093/bioinformatics/btab219

PMID:33787851

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8479680/

Abstract

MOTIVATION

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.

RESULTS

Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.

AVAILABILITYAND IMPLEMENTATION

OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

将新序列分配给已知的蛋白质家族和亚家族是许多功能、比较和进化基因组学分析的前提。这种分配通常通过在参考数据库中查找最接近的序列来实现，使用 BLAST 等方法。然而，忽略基因系统发育可能会产生误导，因为查询序列不一定与其最接近的序列属于同一亚家族。例如，在血红蛋白α/β复制之前分支的血红蛋白可能与血红蛋白α或β序列最接近，但它既不属于α也不属于β。为了解决这个问题，出现了基于系统发育的工具，但它们依赖于基因树，其推断计算成本很高。

结果

在这里，我们首先表明，在多个动物和植物数据集，18-62%的分配由最接近的序列是错误分配的，通常是过度特定的亚家族。然后，我们引入了 OMAmer，一种新的无比对蛋白质亚家族分配方法，它限制了过度特定的亚家族分配，并且适用于具有数千个基因组的系统发生基因组数据库。OMAer 基于一种创新的方法，使用进化信息丰富的 k-mer 进行无比对映射到祖先蛋白质亚家族。虽然能够拒绝非同源家族级别的分配，但我们表明，OMAer 提供了比基于最接近序列的方法更好和更快的亚家族级别分配，无论是通过 Smith-Waterman 还是快速启发式 DIAMOND 精确推断。

可用性和实现

OMAer 可从 Python 包索引（作为 omamer）获得，其源代码和预计算数据库可在 https://github.com/DessimozLab/omamer 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf74/8479680/48e9d0a22b72/btab219f1.jpg

相似文献

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Bioinformatics. 2021 Sep 29;37(18):2866-2873. doi: 10.1093/bioinformatics/btab219.

LocalAli: an evolutionary-based local alignment approach to identify functionally conserved modules in multiple networks.

Bioinformatics. 2015 Feb 1;31(3):363-72. doi: 10.1093/bioinformatics/btu652. Epub 2014 Oct 4.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations.

Bioinformatics. 2019 Feb 1;35(3):518-520. doi: 10.1093/bioinformatics/bty625.

Rapid alignment-free phylogenetic identification of metagenomic sequences.

Bioinformatics. 2019 Sep 15;35(18):3303-3312. doi: 10.1093/bioinformatics/btz068.

TreeSAPP: the Tree-based Sensitive and Accurate Phylogenetic Profiler.

Bioinformatics. 2020 Sep 15;36(18):4706-4713. doi: 10.1093/bioinformatics/btaa588.

EPIK: precise and scalable evolutionary placement with informative k-mers.

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad692.

DeepNOG: fast and accurate protein orthologous group assignment.

Bioinformatics. 2021 Apr 1;36(22-23):5304-5312. doi: 10.1093/bioinformatics/btaa1051.

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.

Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper.

Mol Biol Evol. 2017 Aug 1;34(8):2115-2122. doi: 10.1093/molbev/msx148.

引用本文的文献

Annotation of protein-coding genes in 49 diatom genomes from the Bacillariophyta clade.

Sci Data. 2025 Jun 11;12(1):985. doi: 10.1038/s41597-025-05306-z.

Feature Architecture-Aware Ortholog Search With fDOG Reveals the Distribution of Plant Cell Wall-Degrading Enzymes Across Life.

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf120.

Chromosome-scale genome assembly reveals how repeat elements shape non-coding RNA landscapes active during newt limb regeneration.

Cell Genom. 2025 Feb 12;5(2):100761. doi: 10.1016/j.xgen.2025.100761. Epub 2025 Jan 27.

Orthology inference at scale with FastOMA.

Nat Methods. 2025 Feb;22(2):269-272. doi: 10.1038/s41592-024-02552-8. Epub 2025 Jan 3.

Quest for Orthologs in the Era of Biodiversity Genomics.

Genome Biol Evol. 2024 Oct 9;16(10). doi: 10.1093/gbe/evae224.

Matreex: Compact and Interactive Visualization for Scalable Studies of Large Gene Families.

Genome Biol Evol. 2024 Jun 4;16(6). doi: 10.1093/gbe/evae100.

OMArk, a tool for gene annotation quality control, reveals erroneous gene inference.

Nat Biotechnol. 2025 Jan;43(1):40-41. doi: 10.1038/s41587-024-02155-w.

Quality assessment of gene repertoire annotations with OMArk.

Nat Biotechnol. 2025 Jan;43(1):124-133. doi: 10.1038/s41587-024-02147-w. Epub 2024 Feb 21.

OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA Ecosystem.

Nucleic Acids Res. 2024 Jan 5;52(D1):D513-D521. doi: 10.1093/nar/gkad1020.

本文引用的文献

Benchmarking of alignment-free sequence comparison methods.

Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.

Advances and Applications in the Quest for Orthologs.

Mol Biol Evol. 2019 Oct 1;36(10):2157-2164. doi: 10.1093/molbev/msz150.

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold.

Nat Methods. 2019 Jul;16(7):603-606. doi: 10.1038/s41592-019-0437-4. Epub 2019 Jun 24.

Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions.

Nat Commun. 2019 Apr 12;10(1):1702. doi: 10.1038/s41467-019-09575-2.

Rapid alignment-free phylogenetic identification of metagenomic sequences.

Bioinformatics. 2019 Sep 15;35(18):3303-3312. doi: 10.1093/bioinformatics/btz068.

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.

Nucleic Acids Res. 2019 Jan 8;47(D1):D309-D314. doi: 10.1093/nar/gky1085.

PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools.

Nucleic Acids Res. 2019 Jan 8;47(D1):D419-D426. doi: 10.1093/nar/gky1038.

UniProt: a worldwide hub of protein knowledge.

Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.

OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs.

Nucleic Acids Res. 2019 Jan 8;47(D1):D807-D811. doi: 10.1093/nar/gky1053.

The Pfam protein families database in 2019.

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

OMAmer：基于树的、无需比对的蛋白质亚家族分配方法优于最接近序列的方法。

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

机构信息

Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.

Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.