• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

从病毒基因组预测宿主分类学信息:特征表示的比较。

Predicting host taxonomic information from viral genomes: A comparison of feature representations.

机构信息

MRC-University of Glasgow Centre For Virus Research, Glasgow, United Kingdom.

School of Computing Science, University of Glasgow, Glasgow, United Kingdom.

出版信息

PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.

DOI:10.1371/journal.pcbi.1007894
PMID:32453718
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7307784/
Abstract

The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different 'levels' of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information.

摘要

宏基因组学的兴起导致了病毒发现的指数级增长。然而,这些新的病毒序列大多数都没有指定的宿主。目前用于预测病毒宿主相互作用的机器学习方法往往侧重于核苷酸特征,而忽略了基因组信息的其他表示形式。在这里,我们研究了从病毒基因组表示的四个不同“层次”生成的特征的预测潜力:核苷酸、氨基酸、氨基酸性质和蛋白质结构域。这更充分地利用了病毒基因组中存在的生物学信息。我们收集了一百八十多个二元数据集,涵盖了真核生物和原核生物宿主的所有分类等级的感染性和非感染性病毒,这些数据集用于比较预测病毒宿主相互作用的特征集。我们将病毒基因组转换为四个不同的基因组表示层次,并通过提取 k-mer 组成和预测的蛋白质结构域生成了二十个特征集。我们训练和测试了支持向量机 (SVM) 分类器,以比较这些特征集中的每一个对每个数据集的预测能力。我们的结果表明,所有的基因组表示层次都能一致地预测宿主的分类学,并且随着 k-mer 长度的增加,基于 k-mer 的特征的预测性能也会提高。使用一种基于系统发育的保留方法,我们证明了预测特征集包含反映感染相关宿主的病毒之间进化关系的信号,以及宿主模拟。我们的结果表明,将一系列互补的特征(纯粹从病毒基因组序列中生成)结合起来,可以提高一系列病毒宿主预测任务的准确性,从而实现宿主分类信息的计算分配。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/781648934770/pcbi.1007894.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/6e21e4ef248e/pcbi.1007894.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/96ef6b92e48b/pcbi.1007894.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/ca6f603bd6bb/pcbi.1007894.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/e6603227d5c3/pcbi.1007894.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/52ec4b4de8f5/pcbi.1007894.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/10669862389d/pcbi.1007894.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/c3ec1980bf59/pcbi.1007894.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/939c18fddf6b/pcbi.1007894.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/98419819d8da/pcbi.1007894.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/70a48c95f409/pcbi.1007894.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/af215f434b08/pcbi.1007894.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/781648934770/pcbi.1007894.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/6e21e4ef248e/pcbi.1007894.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/96ef6b92e48b/pcbi.1007894.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/ca6f603bd6bb/pcbi.1007894.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/e6603227d5c3/pcbi.1007894.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/52ec4b4de8f5/pcbi.1007894.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/10669862389d/pcbi.1007894.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/c3ec1980bf59/pcbi.1007894.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/939c18fddf6b/pcbi.1007894.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/98419819d8da/pcbi.1007894.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/70a48c95f409/pcbi.1007894.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/af215f434b08/pcbi.1007894.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec29/7307784/781648934770/pcbi.1007894.g012.jpg

相似文献

1
Predicting host taxonomic information from viral genomes: A comparison of feature representations.从病毒基因组预测宿主分类学信息:特征表示的比较。
PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.
2
Prediction of virus-host infectious association by supervised learning methods.通过监督学习方法预测病毒-宿主感染关联。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.
3
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder:一种新型的基于 k-mer 的工具,用于从组装的宏基因组数据中识别病毒序列。
Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.
4
Evaluation of the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformatic platform: steps towards a unified taxonomy.利用通用生物信息学平台评估感染细菌、古菌和真核生物的病毒的基因组多样性:迈向统一分类法的步骤。
J Gen Virol. 2018 Sep;99(9):1331-1343. doi: 10.1099/jgv.0.001110. Epub 2018 Jul 17.
5
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics.原核病毒宿主预测器:一种用于宏基因组中原核病毒宿主预测的高斯模型。
BMC Biol. 2021 Jan 14;19(1):5. doi: 10.1186/s12915-020-00938-6.
6
Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut.比较不同的组装和注释工具在分析肠道中模拟病毒宏基因组群落中的应用。
BMC Genomics. 2014 Jan 18;15:37. doi: 10.1186/1471-2164-15-37.
7
Phage hunters: Computational strategies for finding phages in large-scale 'omics datasets.噬菌体猎人:在大规模组学数据集中寻找噬菌体的计算策略。
Virus Res. 2018 Jan 15;244:110-115. doi: 10.1016/j.virusres.2017.10.019. Epub 2017 Nov 1.
8
The impact of sequence length and number of sequences on promoter prediction performance.序列长度和序列数量对启动子预测性能的影响。
BMC Bioinformatics. 2015;16 Suppl 19(Suppl 19):S5. doi: 10.1186/1471-2105-16-S19-S5. Epub 2015 Dec 16.
9
Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences.无比对的$d_2^*$寡核苷酸频率差异度量法可改善从宏基因组来源的病毒序列预测宿主的效果。
Nucleic Acids Res. 2017 Jan 9;45(1):39-53. doi: 10.1093/nar/gkw1002. Epub 2016 Nov 28.
10
Host Taxon Predictor - A Tool for Predicting Taxon of the Host of a Newly Discovered Virus.宿主分类预测器 - 一种预测新发现病毒宿主分类的工具。
Sci Rep. 2019 Mar 5;9(1):3436. doi: 10.1038/s41598-019-39847-2.

引用本文的文献

1
The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.分类学、宿主依赖性特征和样本偏差对使用机器学习和短序列k-mer进行病毒宿主预测的影响。
Sci Rep. 2025 Aug 27;15(1):31592. doi: 10.1038/s41598-025-17123-w.
2
Deciphering the Code of Viral-Host Adaptation Through Maximum-Entropy Nucleotide Bias Models.通过最大熵核苷酸偏差模型解读病毒-宿主适应性密码
Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf127.
3
ORF1ab codon frequency model predicts host-pathogen relationship in orthocoronavirinae.

本文引用的文献

1
A Viral Ecogenomics Framework To Uncover the Secrets of Nature's "Microbe Whisperers".一个用于揭开大自然“微生物密语者”秘密的病毒生态基因组学框架。
mSystems. 2019 May 14;4(3):e00111-19. doi: 10.1128/mSystems.00111-19.
2
Host Taxon Predictor - A Tool for Predicting Taxon of the Host of a Newly Discovered Virus.宿主分类预测器 - 一种预测新发现病毒宿主分类的工具。
Sci Rep. 2019 Mar 5;9(1):3436. doi: 10.1038/s41598-019-39847-2.
3
Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains.
ORF1ab密码子频率模型预测正冠状病毒亚科中的宿主-病原体关系。
Front Bioinform. 2025 Mar 18;5:1562668. doi: 10.3389/fbinf.2025.1562668. eCollection 2025.
4
Prediction of virus-host associations using protein language models and multiple instance learning.使用蛋白质语言模型和多实例学习预测病毒-宿主关联
PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.
5
VISTA: A Tool for Fast Taxonomic Assignment of Viral Genome Sequences.VISTA:一种用于病毒基因组序列快速分类归属的工具。
Genomics Proteomics Bioinformatics. 2025 May 10;23(1). doi: 10.1093/gpbjnl/qzae082.
6
Predicting host species susceptibility to influenza viruses and coronaviruses using genome data and machine learning: a scoping review.利用基因组数据和机器学习预测宿主物种对流感病毒和冠状病毒的易感性:一项范围综述
Front Vet Sci. 2024 Sep 25;11:1358028. doi: 10.3389/fvets.2024.1358028. eCollection 2024.
7
RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes.RNAVirHost:一种基于机器学习的方法,通过病毒基因组预测 RNA 病毒的宿主。
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae059.
8
Advances in phage-host interaction prediction: in silico method enhances the development of phage therapies.噬菌体-宿主相互作用预测的进展:计算方法促进噬菌体疗法的发展。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae117.
9
GSPHI: A novel deep learning model for predicting phage-host interactions via multiple biological information.GSPHI:一种通过多种生物信息预测噬菌体-宿主相互作用的新型深度学习模型。
Comput Struct Biotechnol J. 2023 Jun 16;21:3404-3413. doi: 10.1016/j.csbj.2023.06.014. eCollection 2023.
10
Managing the deluge of newly discovered plant viruses and viroids: an optimized scientific and regulatory framework for their characterization and risk analysis.应对新发现植物病毒和类病毒的大量涌现:建立用于其特性鉴定和风险分析的优化科学与监管框架。
Front Microbiol. 2023 May 30;14:1181562. doi: 10.3389/fmicb.2023.1181562. eCollection 2023.
基于保守蛋白结构域对越南蝙蝠和大鼠中冠状病毒科基因组的鉴定与特征分析。
Virus Evol. 2018 Dec 15;4(2):vey035. doi: 10.1093/ve/vey035. eCollection 2018 Jul.
4
Minimum Information about an Uncultivated Virus Genome (MIUViG).未培养病毒基因组信息最低要求(MIUViG)。
Nat Biotechnol. 2019 Jan;37(1):29-37. doi: 10.1038/nbt.4306. Epub 2018 Dec 17.
5
High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.高通量 ANI 分析 9 万余组原核基因组揭示了清晰的物种界限。
Nat Commun. 2018 Nov 30;9(1):5114. doi: 10.1038/s41467-018-07641-9.
6
Computational prediction of inter-species relationships through omics data analysis and machine learning.通过组学数据分析和机器学习预测种间关系
BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):420. doi: 10.1186/s12859-018-2388-7.
7
IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes.IMG/VR v.2.0:一个用于培养和环境病毒基因组的集成数据管理和分析系统。
Nucleic Acids Res. 2019 Jan 8;47(D1):D678-D686. doi: 10.1093/nar/gky1127.
8
Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes.从 RNA 病毒基因组中的进化特征预测储主宿主和节肢动物媒介。
Science. 2018 Nov 2;362(6414):577-580. doi: 10.1126/science.aap9072.
9
Comparative studies of alignment, alignment-free and SVM based approaches for predicting the hosts of viruses based on viral sequences.基于病毒序列预测病毒宿主的基于比对、无比对和 SVM 的方法的比较研究。
Sci Rep. 2018 Jul 3;8(1):10032. doi: 10.1038/s41598-018-28308-x.
10
Universal evolutionary selection for high dimensional silent patterns of information hidden in the redundancy of viral genetic code.普遍进化选择高维沉默模式的信息隐藏在病毒遗传密码的冗余中。
Bioinformatics. 2018 Oct 1;34(19):3241-3248. doi: 10.1093/bioinformatics/bty351.