• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习的人类宏基因组数据中病毒序列检测

Machine Learning for detection of viral sequences in human metagenomic datasets.

机构信息

Dept. of Laboratory Medicine, Karolinska Institutet, F46, Karolinska University Hospital Huddinge, Stockholm, Sweden.

Institute of Computer Science, University of Tartu, Tartu, Estonia.

出版信息

BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.

DOI:10.1186/s12859-018-2340-x
PMID:30249176
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6154907/
Abstract

BACKGROUND

Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as "unknown", as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data.

RESULTS

We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity.

CONCLUSION

RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification.

摘要

背景

从宏基因组测序数据集中检测高度分化或尚未可知的病毒是一个主要的生物信息学挑战。当对人类样本进行测序时,很大一部分组装的连续序列被归类为“未知”,因为传统的方法无法与已知序列相匹配。我们希望探索使用相对同义密码子使用频率(RSCU)的机器学习算法是否可以提高宏基因组测序数据中病毒序列的检测。

结果

我们使用基于分类为病毒和非病毒类别的宏基因组序列对随机森林和人工神经网络进行了训练。这些算法的准确率远远超过了随机水平,ROC 曲线下的面积为 0.79。发现两个密码子(TCG 和 CGC)具有特别强的区分能力。

结论

基于 RSCU 的机器学习技术应用于宏基因组测序数据有助于识别大量推定的病毒序列,并为分类学分类提供了传统方法之外的补充。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/85257692b33d/12859_2018_2340_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/3c11b3fb067b/12859_2018_2340_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/1cd0ea183658/12859_2018_2340_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/8dd3d3d8c6f5/12859_2018_2340_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/0bb2cb5bca99/12859_2018_2340_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/4bebc8c14bee/12859_2018_2340_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/85257692b33d/12859_2018_2340_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/3c11b3fb067b/12859_2018_2340_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/1cd0ea183658/12859_2018_2340_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/8dd3d3d8c6f5/12859_2018_2340_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/0bb2cb5bca99/12859_2018_2340_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/4bebc8c14bee/12859_2018_2340_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271e/6154907/85257692b33d/12859_2018_2340_Fig6_HTML.jpg

相似文献

1
Machine Learning for detection of viral sequences in human metagenomic datasets.基于机器学习的人类宏基因组数据中病毒序列检测
BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.
2
ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.ViraMiner:在原始 DNA 序列上进行深度学习,以鉴定人类样本中的病毒基因组。
PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.
3
Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.应用人工神经网络后校正下一代测序数据中病毒分类群分布的估计。
Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.
4
Prediction of virus-host infectious association by supervised learning methods.通过监督学习方法预测病毒-宿主感染关联。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.
5
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder:一种新型的基于 k-mer 的工具,用于从组装的宏基因组数据中识别病毒序列。
Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.
6
Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。
BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.
7
Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut.比较不同的组装和注释工具在分析肠道中模拟病毒宏基因组群落中的应用。
BMC Genomics. 2014 Jan 18;15:37. doi: 10.1186/1471-2164-15-37.
8
Increase in taxonomic assignment efficiency of viral reads in metagenomic studies.提高宏基因组研究中病毒读段分类学赋值效率。
Virus Res. 2018 Jan 15;244:230-234. doi: 10.1016/j.virusres.2017.11.011. Epub 2017 Nov 14.
9
Cataloguing the taxonomic origins of sequences from a heterogeneous sample using phylogenomics: applications in adventitious agent detection.利用系统发育基因组学对异质样本中序列的分类学起源进行编目:在检测外来因子中的应用。
PDA J Pharm Sci Technol. 2014 Nov-Dec;68(6):602-18. doi: 10.5731/pdajpst.2014.01023.
10
VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.VIBRANT:从基因组序列中自动恢复、注释和培养微生物病毒,并评估病毒群落功能。
Microbiome. 2020 Jun 10;8(1):90. doi: 10.1186/s40168-020-00867-0.

引用本文的文献

1
Mapping evolutionary paradigm of bovine viral diarrhea virus associated with different organizations of nucleotide.绘制与不同核苷酸组织相关的牛病毒性腹泻病毒的进化模式
Virulence. 2025 Dec;16(1):2550620. doi: 10.1080/21505594.2025.2550620. Epub 2025 Aug 29.
2
VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification.VirDetect-AI:一种基于残差和卷积神经网络的宏基因组工具,用于真核病毒蛋白鉴定。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf001.
3
Significance of Artificial Intelligence in the Study of Virus-Host Cell Interactions.

本文引用的文献

1
Massively Parallel Implementation of Sequence Alignment with Basic Local Alignment Search Tool Using Parallel Computing in Java Library.使用Java库中的并行计算通过基本局部比对搜索工具进行序列比对的大规模并行实现。
J Comput Biol. 2018 Aug;25(8):871-881. doi: 10.1089/cmb.2018.0079. Epub 2018 Jul 13.
2
Extension of the viral ecology in humans using viral profile hidden Markov models.利用病毒特征隐藏马尔可夫模型扩展人类病毒生态学研究
PLoS One. 2018 Jan 19;13(1):e0190938. doi: 10.1371/journal.pone.0190938. eCollection 2018.
3
A new and updated resource for codon usage tables.
人工智能在病毒-宿主细胞相互作用研究中的意义。
Biomolecules. 2024 Jul 26;14(8):911. doi: 10.3390/biom14080911.
4
Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2.Hitac:一种与 QIIME2 兼容的真菌 ITS 序列的层次分类学分类器。
BMC Bioinformatics. 2024 Jul 2;25(1):228. doi: 10.1186/s12859-024-05839-x.
5
Viral Metagenomic Analysis of the Fecal Samples in Domestic Dogs ().犬粪便样本的病毒宏基因组分析()。
Viruses. 2023 Mar 6;15(3):685. doi: 10.3390/v15030685.
6
Agnostic Sequencing for Detection of Viral Pathogens.基于无信仰者测序的病毒病原体检测。
Clin Microbiol Rev. 2023 Mar 23;36(1):e0011922. doi: 10.1128/cmr.00119-22. Epub 2023 Feb 27.
7
SARS-CoV-2 virus classification based on stacked sparse autoencoder.基于堆叠稀疏自动编码器的严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒分类
Comput Struct Biotechnol J. 2023;21:284-298. doi: 10.1016/j.csbj.2022.12.007. Epub 2022 Dec 9.
8
Artificial Intelligence and Deep Learning Assisted Rapid Diagnosis of COVID-19 from Chest Radiographical Images: A Survey.人工智能和深度学习辅助 COVID-19 胸部影像学快速诊断:一项调查。
Contrast Media Mol Imaging. 2022 Oct 12;2022:1306664. doi: 10.1155/2022/1306664. eCollection 2022.
9
Defining Novel DNA Virus-Tumor Associations and Genomic Correlates Using Prospective Clinical Tumor/Normal Matched Sequencing Data.使用前瞻性临床肿瘤/正常配对测序数据定义新型 DNA 病毒-肿瘤关联和基因组相关性。
J Mol Diagn. 2022 May;24(5):515-528. doi: 10.1016/j.jmoldx.2022.01.011. Epub 2022 Mar 22.
10
Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.将数据映射到深入理解:充分利用 SARS-CoV-2 基因组序列的洪流。
mSystems. 2022 Apr 26;7(2):e0003522. doi: 10.1128/msystems.00035-22. Epub 2022 Mar 21.
密码子使用表的全新更新资源。
BMC Bioinformatics. 2017 Sep 2;18(1):391. doi: 10.1186/s12859-017-1793-7.
4
Detection of DNA viruses in prostate cancer.前列腺癌中DNA病毒的检测
Sci Rep. 2016 Apr 28;6:25235. doi: 10.1038/srep25235.
5
Importance of codon usage for the temporal regulation of viral gene expression.密码子使用对病毒基因表达时间调控的重要性。
Proc Natl Acad Sci U S A. 2015 Nov 10;112(45):14030-5. doi: 10.1073/pnas.1515387112. Epub 2015 Oct 26.
6
Does human papillomavirus-negative condylomata exist?人乳头瘤病毒阴性的尖锐湿疣存在吗?
Virology. 2015 Nov;485:283-8. doi: 10.1016/j.virol.2015.07.023. Epub 2015 Aug 28.
7
Viremia during pregnancy and risk of childhood leukemia and lymphomas in the offspring: Nested case-control study.孕期病毒血症与子代儿童白血病和淋巴瘤风险:巢式病例对照研究。
Int J Cancer. 2016 May 1;138(9):2212-20. doi: 10.1002/ijc.29666. Epub 2015 Jul 14.
8
Human papillomavirus type 197 is commonly present in skin tumors.197型人乳头瘤病毒通常存在于皮肤肿瘤中。
Int J Cancer. 2015 Jun 1;136(11):2546-55. doi: 10.1002/ijc.29325. Epub 2014 Nov 25.
9
Profile hidden Markov models for the detection of viruses within metagenomic sequence data.用于在宏基因组序列数据中检测病毒的轮廓隐马尔可夫模型。
PLoS One. 2014 Aug 20;9(8):e105067. doi: 10.1371/journal.pone.0105067. eCollection 2014.
10
Deep sequencing extends the diversity of human papillomaviruses in human skin.深度测序扩展了人类皮肤中乳头瘤病毒的多样性。
Sci Rep. 2014 Jul 24;4:5807. doi: 10.1038/srep05807.