Suppr超能文献

短 k- -mer 丰度谱为 RNA 病毒提供了强大的机器学习特征和准确的分类器。

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses.

机构信息

Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh.

出版信息

PLoS One. 2020 Sep 18;15(9):e0239381. doi: 10.1371/journal.pone.0239381. eCollection 2020.

Abstract

High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.

摘要

高通量测序技术极大地推动了基因组学、转录组学和宏基因组学的研究。自动化注释和分类大量生成的序列数据对于促进生物科学至关重要。病毒的基因组在分子结构和一级序列上与所有生命形式都有很大的不同。基于比对和基于轮廓的搜索通常用于从高通量测序数据中对组装的病毒连续体进行特征描述。最近的尝试强调了机器学习模型在该任务中的应用,但这些模型完全依赖于 DNA 基因组,由于病毒的固有基因组复杂性,RNA 病毒完全被忽视了。在这里,我们提出了一种新的基于短 k-mer 的序列评分方法,该方法为训练机器学习分类器生成稳健的序列信息。我们针对从人类转录本中区分病毒 RNA 的任务训练了 18 个分类器。我们使用非常严格的测试协议在不同物种上对我们的模型进行了挑战,并针对 BLASTn、BLASTx 和 HMMER3 搜索进行了性能评估。对于从经过精心整理的数据库中检索到的干净序列数据,我们的模型显示出接近完美的准确性,优于以前报道的所有类似尝试。对于从受埃博拉病毒感染的细胞的原始 RNA-Seq 数据进行从头组装,根据用于组装的软件,ROC 曲线下的面积从 0.6 到 0.86 不等。我们的分类器能够正确分类 BLAST 和 HMMER3 搜索在相同数据上生成的大多数错误命中。我们的模型的出色性能指标为自动化注释序列数据的稳健机器学习方法奠定了基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c33b/7500682/cfd335fb4ec6/pone.0239381.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验