Bartoszewicz Jakub M, Seidel Anja, Renard Bernhard Y
Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany.
NAR Genom Bioinform. 2021 Feb 1;3(1):lqab004. doi: 10.1093/nargab/lqab004. eCollection 2021 Mar.
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
病毒进化速度极快,因此可靠的病毒宿主预测方法对于保障生物安全和生物安保都十分必要。新型人类感染病毒很难通过标准的生物信息学工作流程检测出来。在此,我们直接从二代测序读段预测一种病毒是否能感染人类。我们表明,深度神经网络架构显著优于浅层机器学习和基于同源性的标准算法,将错误率降低了一半,并能推广到与训练期间所呈现的分类单元距离较远的分类单元。此外,我们开发了一套可解释性工具,并表明它也可应用于宿主预测任务之外的其他模型。我们提出了一种用于卷积滤波器可视化的新方法,以从每个核苷酸对最终分类决策的贡献中解析出其信息内容。病原体基因组与感染表型之间习得关联的核苷酸分辨率图谱可用于检测新型病原体中的感兴趣区域,例如,2020年导致COVID-19大流行之前未知的严重急性呼吸综合征冠状病毒2(SARS-CoV-2)。本文介绍的所有方法都实现为易于安装的软件包,不仅无需任何深度学习技能就能对二代测序数据集进行分析,还允许高级用户轻松训练和解释新的基因组学模型。