短 k- -mer 丰度谱为 RNA 病毒提供了强大的机器学习特征和准确的分类器。

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses.

机构信息

Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh.

出版信息

PLoS One. 2020 Sep 18;15(9):e0239381. doi: 10.1371/journal.pone.0239381. eCollection 2020.

DOI:10.1371/journal.pone.0239381

PMID:32946529

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7500682/

Abstract

High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.

摘要

高通量测序技术极大地推动了基因组学、转录组学和宏基因组学的研究。自动化注释和分类大量生成的序列数据对于促进生物科学至关重要。病毒的基因组在分子结构和一级序列上与所有生命形式都有很大的不同。基于比对和基于轮廓的搜索通常用于从高通量测序数据中对组装的病毒连续体进行特征描述。最近的尝试强调了机器学习模型在该任务中的应用，但这些模型完全依赖于 DNA 基因组，由于病毒的固有基因组复杂性，RNA 病毒完全被忽视了。在这里，我们提出了一种新的基于短 k-mer 的序列评分方法，该方法为训练机器学习分类器生成稳健的序列信息。我们针对从人类转录本中区分病毒 RNA 的任务训练了 18 个分类器。我们使用非常严格的测试协议在不同物种上对我们的模型进行了挑战，并针对 BLASTn、BLASTx 和 HMMER3 搜索进行了性能评估。对于从经过精心整理的数据库中检索到的干净序列数据，我们的模型显示出接近完美的准确性，优于以前报道的所有类似尝试。对于从受埃博拉病毒感染的细胞的原始 RNA-Seq 数据进行从头组装，根据用于组装的软件，ROC 曲线下的面积从 0.6 到 0.86 不等。我们的分类器能够正确分类 BLAST 和 HMMER3 搜索在相同数据上生成的大多数错误命中。我们的模型的出色性能指标为自动化注释序列数据的稳健机器学习方法奠定了基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c33b/7500682/cfd335fb4ec6/pone.0239381.g001.jpg

相似文献

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses.短 k- -mer 丰度谱为 RNA 病毒提供了强大的机器学习特征和准确的分类器。

PLoS One. 2020 Sep 18;15(9):e0239381. doi: 10.1371/journal.pone.0239381. eCollection 2020.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder：一种新型的基于 k-mer 的工具，用于从组装的宏基因组数据中识别病毒序列。

Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes.RNAVirHost：一种基于机器学习的方法，通过病毒基因组预测 RNA 病毒的宿主。

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae059.

Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features.Plasmer：一种基于共享 k-mers 和基因组特征的机器学习的准确且灵敏的细菌质粒预测工具。

Microbiol Spectr. 2023 Jun 15;11(3):e0464522. doi: 10.1128/spectrum.04645-22. Epub 2023 May 16.

Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut.比较不同的组装和注释工具在分析肠道中模拟病毒宏基因组群落中的应用。

BMC Genomics. 2014 Jan 18;15:37. doi: 10.1186/1471-2164-15-37.

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.ViraMiner：在原始 DNA 序列上进行深度学习，以鉴定人类样本中的病毒基因组。

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences.基于系统发育基因序列的K-Mer编码预测单链RNA病毒的病毒科和宿主

Comput Biol Chem. 2024 Oct;112:108114. doi: 10.1016/j.compbiolchem.2024.108114. Epub 2024 May 31.

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms.优化从头转录组组装从高通量短读测序数据提高非模式生物的功能注释。

BMC Bioinformatics. 2012 Jul 18;13:170. doi: 10.1186/1471-2105-13-170.

Profile hidden Markov models for the detection of viruses within metagenomic sequence data.用于在宏基因组序列数据中检测病毒的轮廓隐马尔可夫模型。

PLoS One. 2014 Aug 20;9(8):e105067. doi: 10.1371/journal.pone.0105067. eCollection 2014.

Predicting host taxonomic information from viral genomes: A comparison of feature representations.从病毒基因组预测宿主分类学信息：特征表示的比较。

PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.

引用本文的文献

Universal orthologs infer deep phylogenies and improve genome quality assessments.通用直系同源基因可推断深层系统发育并改善基因组质量评估。

BMC Biol. 2025 Jul 28;23(1):224. doi: 10.1186/s12915-025-02328-2.

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.

Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.

(, )-mer-a simple statistical feature for sequence classification.(, )-mer——一种用于序列分类的简单统计特征。

Bioinform Adv. 2023 Jul 11;3(1):vbad088. doi: 10.1093/bioadv/vbad088. eCollection 2023.

Benchmarking Bioinformatic Tools for Amplicon-Based Sequencing of Norovirus.基于扩增子的诺如病毒测序的生物信息学工具的基准测试。

Appl Environ Microbiol. 2023 Jan 31;89(1):e0152222. doi: 10.1128/aem.01522-22. Epub 2022 Dec 21.

Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach.预测玉米中组织特异性mRNA和蛋白质丰度：一种机器学习方法。

Front Artif Intell. 2022 May 26;5:830170. doi: 10.3389/frai.2022.830170. eCollection 2022.

Whole-genome sequencing and gene sharing network analysis powered by machine learning identifies antibiotic resistance sharing between animals, humans and environment in livestock farming.由机器学习驱动的全基因组测序和基因共享网络分析确定了畜牧养殖中动物、人类和环境之间的抗生素耐药性共享情况。

PLoS Comput Biol. 2022 Mar 25;18(3):e1010018. doi: 10.1371/journal.pcbi.1010018. eCollection 2022 Mar.

Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.将数据映射到深入理解：充分利用 SARS-CoV-2 基因组序列的洪流。

mSystems. 2022 Apr 26;7(2):e0003522. doi: 10.1128/msystems.00035-22. Epub 2022 Mar 21.

High Throughput Sequencing for the Detection and Characterization of RNA Viruses.用于RNA病毒检测与特征分析的高通量测序

Front Microbiol. 2021 Feb 22;12:621719. doi: 10.3389/fmicb.2021.621719. eCollection 2021.

本文引用的文献

Identifying viruses from metagenomic data using deep learning.利用深度学习从宏基因组数据中识别病毒。

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding.新冠病毒的基因组特征和流行病学：对病毒起源和受体结合的影响。

Lancet. 2020 Feb 22;395(10224):565-574. doi: 10.1016/S0140-6736(20)30251-8. Epub 2020 Jan 30.

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.ViraMiner：在原始 DNA 序列上进行深度学习，以鉴定人类样本中的病毒基因组。

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data.rnaSPAdes：一种从头转录组组装程序及其在 RNA-Seq 数据中的应用。

Gigascience. 2019 Sep 1;8(9). doi: 10.1093/gigascience/giz100.

Prophage Hunter: an integrative hunting tool for active prophages.噬菌体猎手：一种用于主动噬菌体的综合性搜索工具。

Nucleic Acids Res. 2019 Jul 2;47(W1):W74-W80. doi: 10.1093/nar/gkz380.

De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers.从头转录组组装：短读 RNA-Seq 组装器的全面跨物种比较。

Gigascience. 2019 May 1;8(5). doi: 10.1093/gigascience/giz039.

Machine Learning for detection of viral sequences in human metagenomic datasets.基于机器学习的人类宏基因组数据中病毒序列检测

BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.

Why are RNA virus mutation rates so damn high?为什么 RNA 病毒的突变率如此之高？

PLoS Biol. 2018 Aug 13;16(8):e3000003. doi: 10.1371/journal.pbio.3000003. eCollection 2018 Aug.

The evolutionary history of vertebrate RNA viruses.脊椎动物 RNA 病毒的进化史。

Nature. 2018 Apr;556(7700):197-202. doi: 10.1038/s41586-018-0012-7. Epub 2018 Apr 4.

Discovering viral genomes in human metagenomic data by predicting unknown protein families.通过预测未知蛋白质家族来发现人类宏基因组数据中的病毒基因组。

Sci Rep. 2018 Jan 8;8(1):28. doi: 10.1038/s41598-017-18341-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

短 k- -mer 丰度谱为 RNA 病毒提供了强大的机器学习特征和准确的分类器。

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献