通过特征空间平衡改进分类学分类。

Improving taxonomic classification with feature space balancing.

作者信息

Fuhl Wolfgang, Zabel Susanne, Nieselt Kay

机构信息

University of Tübingen, Institute for Biomedical Informatics (IBMI), Sand 14, Tübingen, Baden-Württemberg, 72076, Germany.

出版信息

Bioinform Adv. 2023 Jul 17;3(1):vbad092. doi: 10.1093/bioadv/vbad092. eCollection 2023.

DOI:10.1093/bioadv/vbad092

PMID:37577265

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10415173/

Abstract

SUMMARY

Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use -mer profiles of DNA sequences as features for taxonomic classification. Although -mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision.

AVAILABILITY AND IMPLEMENTATION

The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

现代高通量测序技术，如宏基因组测序，会生成数百万条需要被归类到其分类等级的序列。现代方法要么将局部比对应用于现有数据库，如MMseqs2，要么使用深度神经网络，如DeepMicrobes和BERTax。由于数据集和数据库规模不断增大，基于比对的方法在运行时成本高昂。基于深度学习的方法可能需要专用硬件且消耗大量能源。在本文中，我们提议将DNA序列的k-mer谱作为分类学分类的特征。尽管k-mer谱此前已被使用，但我们通过对训练数据应用特征空间平衡方法，显著提高了它们的预测能力。这极大地提升了分类器的泛化质量。我们使用提出的特征提取和数据集平衡方法，结合不同的简单分类器，如袋装决策树或特征子空间K近邻算法，实现了不同的流程。通过在两个不同数据集上，将我们流程的性能与诸如BERTax和MMseqs2等最先进算法进行比较，我们表明我们的流程在几乎所有分类任务中都优于这些算法。特别是，对来自未参与训练的生物体的序列进行了高精度分类。

可用性与实现

开源代码及重现结果的代码可在Seafile上获取，网址为https://tinyurl.com/ysk47fmr。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/152e/10415173/6bfc1aa619cc/vbad092f1.jpg

相似文献

Improving taxonomic classification with feature space balancing.通过特征空间平衡改进分类学分类。

Bioinform Adv. 2023 Jul 17;3(1):vbad092. doi: 10.1093/bioadv/vbad092. eCollection 2023.

Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.基于深度神经网络的 DNA 序列分类研究：超越序列相似性的分类方法

Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.

Higher-order Markov models for metagenomic sequence classification.用于宏基因组序列分类的高阶马尔可夫模型。

Bioinformatics. 2020 Aug 15;36(14):4130-4136. doi: 10.1093/bioinformatics/btaa562.

Fast and sensitive taxonomic assignment to metagenomic contigs.快速而敏感的宏基因组序列分类学分配。

Bioinformatics. 2021 Sep 29;37(18):3029-3031. doi: 10.1093/bioinformatics/btab184.

Large-scale machine learning for metagenomics sequence classification.用于宏基因组学序列分类的大规模机器学习

Bioinformatics. 2016 Apr 1;32(7):1023-32. doi: 10.1093/bioinformatics/btv683. Epub 2015 Nov 20.

Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding.基于 k- -mer 嵌入卷积长短期记忆网络的染色质可及性预测。

Bioinformatics. 2017 Jul 15;33(14):i92-i101. doi: 10.1093/bioinformatics/btx234.

Genomic style: yet another deep-learning approach to characterize bacterial genome sequences.基因组风格：另一种用于表征细菌基因组序列的深度学习方法。

Bioinform Adv. 2021 Dec 1;1(1):vbab039. doi: 10.1093/bioadv/vbab039. eCollection 2021.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在（放化疗）治疗结果预测中的应用：分类器的实证比较。

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenomeassembled genomes.MT-MAG：用于宏基因组组装基因组的完整或部分分类学分配的准确且可解释的机器学习。

PLoS One. 2023 Aug 18;18(8):e0283536. doi: 10.1371/journal.pone.0283536. eCollection 2023.

Deep learning models for bacteria taxonomic classification of metagenomic data.基于深度学习的宏基因组数据细菌分类学分类模型

BMC Bioinformatics. 2018 Jul 9;19(Suppl 7):198. doi: 10.1186/s12859-018-2182-6.

引用本文的文献

PCVR: a pre-trained contextualized visual representation for DNA sequence classification.PCVR：用于DNA序列分类的预训练情境化视觉表征

BMC Bioinformatics. 2025 May 9;26(1):125. doi: 10.1186/s12859-025-06136-x.

Taxometer: Improving taxonomic classification of metagenomics contigs.Taxometer：提高宏基因组序列的分类学分类。

Nat Commun. 2024 Sep 27;15(1):8357. doi: 10.1038/s41467-024-52771-y.

本文引用的文献

Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.基于深度神经网络的 DNA 序列分类研究：超越序列相似性的分类方法

Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.

Sensitive protein alignments at tree-of-life scale using DIAMOND.使用 DIAMOND 进行生命之树尺度上的敏感蛋白质比对。

Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.

Fast and sensitive taxonomic assignment to metagenomic contigs.快速而敏感的宏基因组序列分类学分配。

Bioinformatics. 2021 Sep 29;37(18):3029-3031. doi: 10.1093/bioinformatics/btab184.

DeepMicrobes: taxonomic classification for metagenomics with deep learning.深度微生物：用于宏基因组学的深度学习分类法

NAR Genom Bioinform. 2020 Feb 19;2(1):lqaa009. doi: 10.1093/nargab/lqaa009. eCollection 2020 Mar.

Global diversity of microbial communities in marine sediment.海洋沉积物中微生物群落的全球多样性。

Proc Natl Acad Sci U S A. 2020 Nov 3;117(44):27587-27597. doi: 10.1073/pnas.1919139117. Epub 2020 Oct 19.

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.跟上基因组的步伐：高效学习我们日益增长的生命之树知识。

BMC Bioinformatics. 2020 Sep 21;21(1):412. doi: 10.1186/s12859-020-03744-7.

Improved metagenomic analysis with Kraken 2.Kraken 2 提升宏基因组分析。

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

Benchmarking Metagenomics Tools for Taxonomic Classification.基于元基因组工具的分类学基准测试。

Cell. 2019 Aug 8;178(4):779-794. doi: 10.1016/j.cell.2019.07.010.

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索，以分析海量数据集。

Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.元基因组分类器的综合基准测试和集成方法。

Genome Biol. 2017 Sep 21;18(1):182. doi: 10.1186/s13059-017-1299-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过特征空间平衡改进分类学分类。

Improving taxonomic classification with feature space balancing.

作者信息

机构信息

出版信息

SUMMARY

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

摘要

可用性与实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献