• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过监督学习方法预测病毒-宿主感染关联。

Prediction of virus-host infectious association by supervised learning methods.

作者信息

Zhang Mengge, Yang Lianping, Ren Jie, Ahlgren Nathan A, Fuhrman Jed A, Sun Fengzhu

机构信息

Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.

College of Sciences, Northeastern University, Shenyang, China.

出版信息

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.

DOI:10.1186/s12859-017-1473-7
PMID:28361670
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5374558/
Abstract

BACKGROUND

The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations.

METHODS

We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments.

RESULTS

Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and [Formula: see text] dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus.

CONCLUSIONS

The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/7ff6d5e59197/12859_2017_1473_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/71290100b9e6/12859_2017_1473_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/7bf060a0b581/12859_2017_1473_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/7ff6d5e59197/12859_2017_1473_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/71290100b9e6/12859_2017_1473_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/7bf060a0b581/12859_2017_1473_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23a8/5374558/7ff6d5e59197/12859_2017_1473_Fig3_HTML.jpg
摘要

背景

病毒-宿主感染关联的研究对于理解微生物群落的功能和动态至关重要。细胞和分级病毒宏基因组数据都会产生大量缺少宿主信息的病毒重叠群。尽管已经开发了基于病毒和细菌宿主词频向量之间相似性的相对简单的方法来研究病毒-宿主关联,但这个问题仍未得到充分研究。我们假设基于词频的机器学习方法可以有效地用于研究病毒-宿主感染关联。

方法

我们研究了病毒序列词频的四种不同表示形式,包括相对词频以及通过从观察到的词计数中减去预期数量得到的三种归一化词频。我们还研究了五种机器学习方法,包括逻辑回归、支持向量机、随机森林、高斯朴素贝叶斯和伯努利朴素贝叶斯,用于区分九种细菌宿主属的感染性病毒和非感染性病毒,每种宿主属至少有45种感染病毒。使用受试者工作特征曲线下面积(AUC)来比较不同机器学习方法和特征组合的性能。然后,我们评估最佳方法在宏基因组研究中识别重叠群宿主的性能。我们还开发了一种最大似然方法来估计病毒标记实验中给定宿主的真正感染性病毒的比例。

结果

基于九种细菌宿主属,每种宿主属至少有45种感染性病毒,我们表明随机森林与相对词频向量相结合在识别感染特定宿主的病毒方面表现最佳。对于所有九种宿主属,当词大小为6时,AUC超过0.85,其中五种宿主属的AUC高于0.98,这表明使用机器学习方法识别感染特定宿主的病毒具有很高的准确性。我们还表明,我们的方法可以在宏基因组研究中高精度地预测长度至少为1kbps的病毒重叠群的宿主。随机森林与词频向量相结合的方法优于目前基于曼哈顿和[公式:见原文]差异度量的可用方法。基于词频,我们估计在病毒标记实验中鉴定出的约95%的类T4病毒感染了聚球藻属,而在该研究中鉴定出的非类T4病毒中只有约29%以及重叠群中只有30%可能感染聚球藻属。

结论

随机森林机器学习方法与作为病毒特征的相对词频相结合,可用于预测特定细菌宿主的病毒和病毒重叠群。最大似然方法可用于估计病毒标记实验中真正感染相关病毒的比例。

相似文献

1
Prediction of virus-host infectious association by supervised learning methods.通过监督学习方法预测病毒-宿主感染关联。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.
2
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder:一种新型的基于 k-mer 的工具,用于从组装的宏基因组数据中识别病毒序列。
Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.
3
Predicting host taxonomic information from viral genomes: A comparison of feature representations.从病毒基因组预测宿主分类学信息:特征表示的比较。
PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.
4
Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences.无比对的$d_2^*$寡核苷酸频率差异度量法可改善从宏基因组来源的病毒序列预测宿主的效果。
Nucleic Acids Res. 2017 Jan 9;45(1):39-53. doi: 10.1093/nar/gkw1002. Epub 2016 Nov 28.
5
Microbial Diversity and Phage-Host Interactions in the Georgian Coastal Area of the Black Sea Revealed by Whole Genome Metagenomic Sequencing.通过全基因组宏基因组测序揭示黑海格鲁吉亚沿海地区的微生物多样性和噬菌体-宿主相互作用。
Mar Drugs. 2020 Nov 14;18(11):558. doi: 10.3390/md18110558.
6
Machine Learning for detection of viral sequences in human metagenomic datasets.基于机器学习的人类宏基因组数据中病毒序列检测
BMC Bioinformatics. 2018 Sep 24;19(1):336. doi: 10.1186/s12859-018-2340-x.
7
From deep sequencing to viral tagging: recent advances in viral metagenomics.从深度测序到病毒标记:病毒宏基因组学的最新进展。
Bioessays. 2013 May;35(5):436-42. doi: 10.1002/bies.201200174. Epub 2013 Mar 1.
8
Mini-Metagenomics and Nucleotide Composition Aid the Identification and Host Association of Novel Bacteriophage Sequences.宏基因组学与核苷酸组成辅助新型噬菌体序列的鉴定及其宿主关联分析
Adv Biosyst. 2019 Nov;3(11):e1900108. doi: 10.1002/adbi.201900108. Epub 2019 Aug 16.
9
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics.原核病毒宿主预测器:一种用于宏基因组中原核病毒宿主预测的高斯模型。
BMC Biol. 2021 Jan 14;19(1):5. doi: 10.1186/s12915-020-00938-6.
10
ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.ViraMiner:在原始 DNA 序列上进行深度学习,以鉴定人类样本中的病毒基因组。
PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

引用本文的文献

1
The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.分类学、宿主依赖性特征和样本偏差对使用机器学习和短序列k-mer进行病毒宿主预测的影响。
Sci Rep. 2025 Aug 27;15(1):31592. doi: 10.1038/s41598-025-17123-w.
2
Recent Applications of Artificial Intelligence in Discovery of New Antibacterial Agents.人工智能在新型抗菌药物发现中的最新应用
Adv Appl Bioinform Chem. 2024 Dec 3;17:139-157. doi: 10.2147/AABC.S484321. eCollection 2024.
3
A predictive approach for host-pathogen interactions using deep learning and protein sequences.

本文引用的文献

1
Re-examination of the relationship between marine virus and microbial cell abundances.重新审视海洋病毒与微生物细胞丰度之间的关系。
Nat Microbiol. 2016 Jan 25;1:15024. doi: 10.1038/nmicrobiol.2015.24.
2
Viral dark matter and virus-host interactions resolved from publicly available microbial genomes.从公开的微生物基因组中解析出的病毒暗物质与病毒-宿主相互作用。
Elife. 2015 Jul 22;4:e08490. doi: 10.7554/eLife.08490.
3
Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。
一种利用深度学习和蛋白质序列预测宿主-病原体相互作用的方法。
Virusdisease. 2024 Sep;35(3):434-445. doi: 10.1007/s13337-024-00882-x. Epub 2024 Jul 16.
4
Leveraging machine learning to distinguish between bacterial and viral induced pharyngitis using hematological markers: a retrospective cohort study.利用机器学习通过血液学标志物区分细菌性和病毒性咽炎:一项回顾性队列研究。
Sci Rep. 2023 Dec 21;13(1):22899. doi: 10.1038/s41598-023-49925-1.
5
GSPHI: A novel deep learning model for predicting phage-host interactions via multiple biological information.GSPHI:一种通过多种生物信息预测噬菌体-宿主相互作用的新型深度学习模型。
Comput Struct Biotechnol J. 2023 Jun 16;21:3404-3413. doi: 10.1016/j.csbj.2023.06.014. eCollection 2023.
6
Using machine learning to detect coronaviruses potentially infectious to humans.利用机器学习检测对人类具有潜在传染性的冠状病毒。
Sci Rep. 2023 Jun 8;13(1):9319. doi: 10.1038/s41598-023-35861-7.
7
Computational Tools for the Analysis of Uncultivated Phage Genomes.用于分析未培养噬菌体基因组的计算工具。
Microbiol Mol Biol Rev. 2022 Jun 15;86(2):e0000421. doi: 10.1128/mmbr.00004-21. Epub 2022 Mar 21.
8
RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content.RaFAH:基于蛋白质含量对细菌和古菌病毒进行宿主预测。
Patterns (N Y). 2021 Jun 15;2(7):100274. doi: 10.1016/j.patter.2021.100274. eCollection 2021 Jul 9.
9
Application of machine learning in bacteriophage research.机器学习在噬菌体研究中的应用。
BMC Microbiol. 2021 Jun 26;21(1):193. doi: 10.1186/s12866-021-02256-5.
10
Reservoir hosts prediction for COVID-19 by hybrid transfer learning model.基于混合迁移学习模型的 COVID-19 储户预测。
J Biomed Inform. 2021 May;117:103736. doi: 10.1016/j.jbi.2021.103736. Epub 2021 Mar 9.
Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.
4
Ocean plankton. Patterns and ecological drivers of ocean viral communities.海洋浮游生物。海洋病毒群落的模式和生态驱动因素。
Science. 2015 May 22;348(6237):1261498. doi: 10.1126/science.1261498.
5
The virome in host health and disease.宿主健康与疾病中的病毒组
Immunity. 2015 May 19;42(5):805-13. doi: 10.1016/j.immuni.2015.05.003.
6
Disease-specific alterations in the enteric virome in inflammatory bowel disease.炎症性肠病中肠道病毒组的疾病特异性改变。
Cell. 2015 Jan 29;160(3):447-60. doi: 10.1016/j.cell.2015.01.002. Epub 2015 Jan 22.
7
Alterations of the human gut microbiome in liver cirrhosis.肝硬化患者的肠道微生物组变化。
Nature. 2014 Sep 4;513(7516):59-64. doi: 10.1038/nature13568. Epub 2014 Jul 23.
8
Viral tagging reveals discrete populations in Synechococcus viral genome sequence space.病毒标签揭示了聚球藻病毒基因组序列空间中的离散种群。
Nature. 2014 Sep 11;513(7517):242-5. doi: 10.1038/nature13459. Epub 2014 Jul 13.
9
Bacteriophages: an underestimated role in human and animal health?噬菌体:在人类和动物健康中被低估的作用?
Front Cell Infect Microbiol. 2014 Mar 28;4:39. doi: 10.3389/fcimb.2014.00039. eCollection 2014.
10
Fast alignment-free sequence comparison using spaced-word frequencies.基于空位词频的快速无比对序列比较。
Bioinformatics. 2014 Jul 15;30(14):1991-9. doi: 10.1093/bioinformatics/btu177. Epub 2014 Apr 3.