从基因组测序数据中对新型人类病毒进行可解释的检测。

Interpretable detection of novel human viruses from genome sequencing data.

作者信息

Bartoszewicz Jakub M, Seidel Anja, Renard Bernhard Y

机构信息

Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany.

出版信息

NAR Genom Bioinform. 2021 Feb 1;3(1):lqab004. doi: 10.1093/nargab/lqab004. eCollection 2021 Mar.

DOI:10.1093/nargab/lqab004

PMID:33554119

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7849996/

Abstract

Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.

摘要

病毒进化速度极快，因此可靠的病毒宿主预测方法对于保障生物安全和生物安保都十分必要。新型人类感染病毒很难通过标准的生物信息学工作流程检测出来。在此，我们直接从二代测序读段预测一种病毒是否能感染人类。我们表明，深度神经网络架构显著优于浅层机器学习和基于同源性的标准算法，将错误率降低了一半，并能推广到与训练期间所呈现的分类单元距离较远的分类单元。此外，我们开发了一套可解释性工具，并表明它也可应用于宿主预测任务之外的其他模型。我们提出了一种用于卷积滤波器可视化的新方法，以从每个核苷酸对最终分类决策的贡献中解析出其信息内容。病原体基因组与感染表型之间习得关联的核苷酸分辨率图谱可用于检测新型病原体中的感兴趣区域，例如，2020年导致COVID-19大流行之前未知的严重急性呼吸综合征冠状病毒2（SARS-CoV-2）。本文介绍的所有方法都实现为易于安装的软件包，不仅无需任何深度学习技能就能对二代测序数据集进行分析，还允许高级用户轻松训练和解释新的基因组学模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e34d/7849996/3fcc9ab7d276/lqab004fig1.jpg

相似文献

Interpretable detection of novel human viruses from genome sequencing data.从基因组测序数据中对新型人类病毒进行可解释的检测。

NAR Genom Bioinform. 2021 Feb 1;3(1):lqab004. doi: 10.1093/nargab/lqab004. eCollection 2021 Mar.

Deep learning-based real-time detection of novel pathogens during sequencing.基于深度学习的测序过程中新病原体的实时检测。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab269.

A Universal Next-Generation Sequencing Protocol To Generate Noninfectious Barcoded cDNA Libraries from High-Containment RNA Viruses.一种用于从高致病性RNA病毒生成无感染性条形码cDNA文库的通用下一代测序方案。

mSystems. 2016 Jun 7;1(3). doi: 10.1128/mSystems.00039-15. eCollection 2016 May-Jun.

Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。

Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.

Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.应用人工神经网络后校正下一代测序数据中病毒分类群分布的估计。

Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.

Coronavirus discovery by metagenomic sequencing: a tool for pandemic preparedness.宏基因组测序发现冠状病毒：大流行防范的工具。

J Clin Virol. 2020 Oct;131:104594. doi: 10.1016/j.jcv.2020.104594. Epub 2020 Aug 21.

Virus detection in high-throughput sequencing data without a reference genome of the host.在没有宿主参考基因组的高通量测序数据中进行病毒检测。

Infect Genet Evol. 2018 Dec;66:180-187. doi: 10.1016/j.meegid.2018.09.026. Epub 2018 Oct 3.

Detecting DNA of novel fungal pathogens using ResNets and a curated fungi-hosts data collection.利用 ResNets 和经过整理的真菌-宿主数据集检测新型真菌病原体的 DNA。

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii168-ii174. doi: 10.1093/bioinformatics/btac495.

DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks.DeePaC：利用反向互补神经网络预测新型 DNA 的致病潜力。

Bioinformatics. 2020 Jan 1;36(1):81-89. doi: 10.1093/bioinformatics/btz541.

VGEA: an RNA viral assembly toolkit.VGEA：一种RNA病毒组装工具包。

PeerJ. 2021 Sep 6;9:e12129. doi: 10.7717/peerj.12129. eCollection 2021.

引用本文的文献

Mining Porcine Blood Whole-DNA Sequencing Datasets to Uncover Pig Viromes: An Exploratory Application to Identify Potential Infecting Agents of an Undefined Disease Outbreak.挖掘猪全血DNA测序数据集以揭示猪病毒组：一项用于识别未定义疾病暴发潜在感染因子的探索性应用。

Vet Sci. 2025 May 24;12(6):513. doi: 10.3390/vetsci12060513.

Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models.使用机器学习模型评估人畜共患病毒溢出风险中的潜在挑战。

Commun Med (Lond). 2025 May 20;5(1):187. doi: 10.1038/s43856-025-00903-w.

AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare.人工智能驱动的精准医学：利用遗传风险因素优化彻底改变医疗保健。

NAR Genom Bioinform. 2025 May 5;7(2):lqaf038. doi: 10.1093/nargab/lqaf038. eCollection 2025 Jun.

DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.DNASimCLR：一种基于对比学习的深度学习方法，用于基因序列数据分类。

BMC Bioinformatics. 2024 Oct 14;25(1):328. doi: 10.1186/s12859-024-05955-8.

Predicting host species susceptibility to influenza viruses and coronaviruses using genome data and machine learning: a scoping review.利用基因组数据和机器学习预测宿主物种对流感病毒和冠状病毒的易感性：一项范围综述

Front Vet Sci. 2024 Sep 25;11:1358028. doi: 10.3389/fvets.2024.1358028. eCollection 2024.

Cyber-biological convergence: a systematic review and future outlook.网络生物学融合：系统综述与未来展望

Front Bioeng Biotechnol. 2024 Sep 24;12:1456354. doi: 10.3389/fbioe.2024.1456354. eCollection 2024.

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes.RNAVirHost：一种基于机器学习的方法，通过病毒基因组预测 RNA 病毒的宿主。

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae059.

Deep learning guided prediction modeling of dengue virus evolving serotype.深度学习引导的登革热病毒进化血清型预测建模

Heliyon. 2024 May 29;10(11):e32061. doi: 10.1016/j.heliyon.2024.e32061. eCollection 2024 Jun 15.

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.NeuralBeds：用于高效DNA数据压缩和优化相似性搜索的神经嵌入

Comput Struct Biotechnol J. 2024 Jan 15;23:732-741. doi: 10.1016/j.csbj.2023.12.046. eCollection 2024 Dec.

Characterisation of putative novel tick viruses and zoonotic risk prediction.新型蜱传病毒的鉴定及人畜共患病风险预测

Ecol Evol. 2024 Jan 21;14(1):e10814. doi: 10.1002/ece3.10814. eCollection 2024 Jan.

本文引用的文献

Identifying viruses from metagenomic data using deep learning.利用深度学习从宏基因组数据中识别病毒。

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

Low-N protein engineering with data-efficient deep learning.低蛋白工程与数据高效深度学习。

Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.

Base-resolution models of transcription-factor binding reveal soft motif syntax.基于分辨率的转录因子结合模型揭示了软基序语法。

Nat Genet. 2021 Mar;53(3):354-366. doi: 10.1038/s41588-021-00782-6. Epub 2021 Feb 18.

Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding.深度突变扫描 SARS-CoV-2 受体结合域揭示了折叠和 ACE2 结合的限制。

Cell. 2020 Sep 3;182(5):1295-1310.e20. doi: 10.1016/j.cell.2020.08.012. Epub 2020 Aug 11.

VIDHOP, viral host prediction with deep learning.VIDHOP，基于深度学习的病毒宿主预测。

Bioinformatics. 2021 Apr 20;37(3):318-325. doi: 10.1093/bioinformatics/btaa705.

Enhanced Integrated Gradients: improving interpretability of deep learning models using splicing codes as a case study.增强型集成梯度：以拼接码为例，提高深度学习模型的可解释性。

Genome Biol. 2020 Jun 19;21(1):149. doi: 10.1186/s13059-020-02055-7.

Assessing the Risks Posed by the Convergence of Artificial Intelligence and Biotechnology.评估人工智能和生物技术融合带来的风险。

Health Secur. 2020 May/Jun;18(3):219-227. doi: 10.1089/hs.2019.0122.

Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody.人类单克隆 SARS-CoV 抗体对 SARS-CoV-2 的交叉中和作用。

Nature. 2020 Jul;583(7815):290-295. doi: 10.1038/s41586-020-2349-y. Epub 2020 May 18.

A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV.SARS-CoV-2 和 SARS 冠状病毒受体结合域中高度保守的隐蔽表位。

Science. 2020 May 8;368(6491):630-633. doi: 10.1126/science.abb7269. Epub 2020 Apr 3.

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2.严重急性呼吸综合征相关冠状病毒：将 2019-nCoV 进行分类并命名为 SARS-CoV-2。

Nat Microbiol. 2020 Apr;5(4):536-544. doi: 10.1038/s41564-020-0695-z. Epub 2020 Mar 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从基因组测序数据中对新型人类病毒进行可解释的检测。

Interpretable detection of novel human viruses from genome sequencing data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献