Suppr超能文献

基于机器学习的人类宏基因组数据中病毒序列检测

Machine Learning for detection of viral sequences in human metagenomic datasets.

机构信息

Dept. of Laboratory Medicine, Karolinska Institutet, F46, Karolinska University Hospital Huddinge, Stockholm, Sweden.

Institute of Computer Science, University of Tartu, Tartu, Estonia.

出版信息

BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.

Abstract

BACKGROUND

Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as "unknown", as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data.

RESULTS

We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity.

CONCLUSION

RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification.

摘要

背景

从宏基因组测序数据集中检测高度分化或尚未可知的病毒是一个主要的生物信息学挑战。当对人类样本进行测序时,很大一部分组装的连续序列被归类为“未知”,因为传统的方法无法与已知序列相匹配。我们希望探索使用相对同义密码子使用频率(RSCU)的机器学习算法是否可以提高宏基因组测序数据中病毒序列的检测。

结果

我们使用基于分类为病毒和非病毒类别的宏基因组序列对随机森林和人工神经网络进行了训练。这些算法的准确率远远超过了随机水平,ROC 曲线下的面积为 0.79。发现两个密码子(TCG 和 CGC)具有特别强的区分能力。

结论

基于 RSCU 的机器学习技术应用于宏基因组测序数据有助于识别大量推定的病毒序列,并为分类学分类提供了传统方法之外的补充。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/3c11b3fb067b/12859_2018_2340_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验