Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.
Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland.
Bioinformatics. 2021 Sep 29;37(18):2866-2873. doi: 10.1093/bioinformatics/btab219.
Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.
Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.
OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.
Supplementary data are available at Bioinformatics online.
将新序列分配给已知的蛋白质家族和亚家族是许多功能、比较和进化基因组学分析的前提。这种分配通常通过在参考数据库中查找最接近的序列来实现,使用 BLAST 等方法。然而,忽略基因系统发育可能会产生误导,因为查询序列不一定与其最接近的序列属于同一亚家族。例如,在血红蛋白α/β复制之前分支的血红蛋白可能与血红蛋白α或β序列最接近,但它既不属于α也不属于β。为了解决这个问题,出现了基于系统发育的工具,但它们依赖于基因树,其推断计算成本很高。
在这里,我们首先表明,在多个动物和植物数据集,18-62%的分配由最接近的序列是错误分配的,通常是过度特定的亚家族。然后,我们引入了 OMAmer,一种新的无比对蛋白质亚家族分配方法,它限制了过度特定的亚家族分配,并且适用于具有数千个基因组的系统发生基因组数据库。OMAer 基于一种创新的方法,使用进化信息丰富的 k-mer 进行无比对映射到祖先蛋白质亚家族。虽然能够拒绝非同源家族级别的分配,但我们表明,OMAer 提供了比基于最接近序列的方法更好和更快的亚家族级别分配,无论是通过 Smith-Waterman 还是快速启发式 DIAMOND 精确推断。
OMAer 可从 Python 包索引(作为 omamer)获得,其源代码和预计算数据库可在 https://github.com/DessimozLab/omamer 上获得。
补充数据可在 Bioinformatics 在线获得。