Fuhl Wolfgang, Zabel Susanne, Nieselt Kay
University of Tübingen, Institute for Biomedical Informatics (IBMI), Sand 14, Tübingen, Baden-Württemberg, 72076, Germany.
Bioinform Adv. 2023 Jul 17;3(1):vbad092. doi: 10.1093/bioadv/vbad092. eCollection 2023.
Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use -mer profiles of DNA sequences as features for taxonomic classification. Although -mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision.
The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr.
Supplementary data are available at online.
现代高通量测序技术,如宏基因组测序,会生成数百万条需要被归类到其分类等级的序列。现代方法要么将局部比对应用于现有数据库,如MMseqs2,要么使用深度神经网络,如DeepMicrobes和BERTax。由于数据集和数据库规模不断增大,基于比对的方法在运行时成本高昂。基于深度学习的方法可能需要专用硬件且消耗大量能源。在本文中,我们提议将DNA序列的k-mer谱作为分类学分类的特征。尽管k-mer谱此前已被使用,但我们通过对训练数据应用特征空间平衡方法,显著提高了它们的预测能力。这极大地提升了分类器的泛化质量。我们使用提出的特征提取和数据集平衡方法,结合不同的简单分类器,如袋装决策树或特征子空间K近邻算法,实现了不同的流程。通过在两个不同数据集上,将我们流程的性能与诸如BERTax和MMseqs2等最先进算法进行比较,我们表明我们的流程在几乎所有分类任务中都优于这些算法。特别是,对来自未参与训练的生物体的序列进行了高精度分类。
开源代码及重现结果的代码可在Seafile上获取,网址为https://tinyurl.com/ysk47fmr。
补充数据可在网上获取。