• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大规模无比对病毒序列分类

Alignment-Free Viral Sequence Classification at Scale.

作者信息

van Zyl Daniel J, Dunaiski Marcel, Tegally Houriiyah, Baxter Cheryl, de Oliveira Tulio, Xavier Joicymara S

机构信息

Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.

Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.

出版信息

bioRxiv. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186.

DOI:10.1101/2024.12.10.627186
PMID:39713356
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11661207/
Abstract

BACKGROUND

The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

RESULTS

We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

CONCLUSION

Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

摘要

背景

下一代测序(NGS)技术产生的核苷酸序列数据迅速增加,这就需要高效的计算工具来进行序列比较。基于比对的方法,如BLAST,由于其对分类的高计算需求,越来越难以应对当代数据集的规模。本研究评估了无比对(AF)方法作为病毒序列分类的可扩展且快速的替代方法,重点是识别在应用于超大型数据集时能保持高精度和高效率的技术。

结果

我们采用了六种既定的AF技术从病毒基因组中提取特征向量,随后用于训练随机森林分类器。我们的主要数据集包含297,186个SARS-CoV-2核苷酸序列,分为3502个不同的谱系。此外,我们使用登革热和HIV序列验证了我们的模型,以证明其在不同病毒数据集上的稳健性。我们的AF分类器在SARS-CoV-2测试集上的准确率达到97.8%,在登革热和HIV测试集上的准确率分别为99.8%和89.1%。

结论

尽管类别维度很高,但我们表明基于词的AF方法能够有效地表示病毒序列。我们的研究突出了AF技术的实际优势,包括与基于比对的方法相比处理速度明显更快,以及能够使用适度的计算资源对序列进行分类。

相似文献

1
Alignment-Free Viral Sequence Classification at Scale.大规模无比对病毒序列分类
bioRxiv. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186.
2
Alignment-free viral sequence classification at scale.大规模无比对病毒序列分类
BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5.
3
ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP:利用数字信号处理进行机器学习,实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。
BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.
4
GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs.GRAMEP:一种基于最大熵原理的无比对单核苷酸多态性识别方法。
BMC Bioinformatics. 2025 Feb 25;26(1):66. doi: 10.1186/s12859-025-06037-z.
5
CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences.CGRclust:用于未标记DNA序列双对比聚类的混沌游戏表示法
BMC Genomics. 2024 Dec 18;25(1):1214. doi: 10.1186/s12864-024-11135-y.
6
Deepvirusclassifier: a deep learning tool for classifying SARS-CoV-2 based on viral subtypes within the coronaviridae family.深病毒分类器:一种基于冠状病毒科内病毒亚型对 SARS-CoV-2 进行分类的深度学习工具。
BMC Bioinformatics. 2024 Jul 5;25(1):231. doi: 10.1186/s12859-024-05754-1.
7
A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis.一篇关于高通量测序数据分析中特征选择和特征提取进展的综述。
Funct Integr Genomics. 2024 Aug 19;24(5):139. doi: 10.1007/s10142-024-01415-x.
8
A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes.一种基于核苷酸理化特征的 DNA 序列新分析方法,可准确分析 SARS-CoV-2 基因组。
BMC Genomics. 2023 May 18;24(1):266. doi: 10.1186/s12864-023-09373-7.
9
Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors.评估机器学习分类算法在具有长读特定错误生成的 SARS-CoV-2 基因组序列上的弹性。
Biomolecules. 2023 Jun 2;13(6):934. doi: 10.3390/biom13060934.
10
Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.利用内在基因组特征进行机器学习,快速分类新型病原体:COVID-19 案例研究。
PLoS One. 2020 Apr 24;15(4):e0232391. doi: 10.1371/journal.pone.0232391. eCollection 2020.

本文引用的文献

1
Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures.基于机器学习的方法 KEVOLVE 能够有效地识别 SARS-CoV-2 变异特异性基因组特征。
PLoS One. 2024 Jan 19;19(1):e0296627. doi: 10.1371/journal.pone.0296627. eCollection 2024.
2
Exploring the Lethality of Human-Adapted Coronavirus Through Alignment-Free Machine Learning Approaches Using Genomic Sequences.利用基因组序列通过无比对机器学习方法探索适应人类的冠状病毒的致死性。
Curr Genomics. 2021 Dec 31;22(8):583-595. doi: 10.2174/1389202923666211221110857.
3
Covidex: An ultrafast and accurate tool for SARS-CoV-2 subtyping.
Covidex:一种用于 SARS-CoV-2 亚型分析的超快、准确工具。
Infect Genet Evol. 2022 Apr;99:105261. doi: 10.1016/j.meegid.2022.105261. Epub 2022 Feb 26.
4
Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences.使用 SARS-CoV-2 刺突基因核苷酸序列对 Pango 谱系进行指定和分配。
BMC Genomics. 2022 Feb 11;23(1):121. doi: 10.1186/s12864-022-08358-2.
5
GISAID's Role in Pandemic Response.全球流感共享数据库(GISAID)在大流行应对中的作用。
China CDC Wkly. 2021 Dec 3;3(49):1049-1051. doi: 10.46234/ccdcw2021.255.
6
Feature extraction approaches for biological sequences: a comparative study of mathematical features.生物序列的特征提取方法:数学特征的比较研究。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab011.
7
Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。
Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.
8
Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences.
J Comput Biol. 2019 Jun;26(6):519-535. doi: 10.1089/cmb.2018.0239. Epub 2019 May 3.
9
ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP:利用数字信号处理进行机器学习,实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。
BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.
10
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.