Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen, 2100, Denmark.
The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
Nat Commun. 2024 Sep 27;15(1):8357. doi: 10.1038/s41467-024-52771-y.
For taxonomy based classification of metagenomics assembled contigs, current methods use sequence similarity to identify their most likely taxonomy. However, in the related field of metagenomic binning, contigs are routinely clustered using information from both the contig sequences and their abundance. We introduce Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier using contig abundance profiles and tetra-nucleotide frequencies. We apply Taxometer to five short-read CAMI2 datasets and find that it increases the average share of correct species-level contig annotations of the MMSeqs2 tool from 66.6% to 86.2%. Additionally, it reduce the share of wrong species-level annotations in the CAMI2 Rhizosphere dataset by an average of two-fold for Metabuli, Centrifuge, and Kraken2. Futhermore, we use Taxometer for benchmarking taxonomic classifiers on two complex long-read metagenomics data sets where ground truth is not known. Taxometer is available as open-source software and can enhance any taxonomic annotation of metagenomic contigs.
对于基于分类法的宏基因组组装序列分类,当前的方法使用序列相似性来确定其最可能的分类。然而,在相关的宏基因组分箱领域,序列通常使用序列信息和丰度信息对序列进行聚类。我们引入了 Taxometer,这是一种基于神经网络的方法,它使用序列丰度分布和四核苷酸频率来改进注释,并估计任何分类器的质量。我们将 Taxometer 应用于五个短读长 CAMI2 数据集,发现它将 MMSeqs2 工具的正确物种水平序列注释的平均比例从 66.6%提高到 86.2%。此外,它还将 Metaboli、Centrifuge 和 Kraken2 在 CAMI2 根际数据集的错误物种水平注释的比例平均降低了两倍。此外,我们还在两个复杂的长读宏基因组数据集上使用 Taxometer 对分类器进行基准测试,这些数据集的真实情况并不为人所知。Taxometer 是一个开源软件,可以增强宏基因组序列的任何分类注释。