Suppr超能文献

机器能否学习 SARS-CoV-2 的突变特征,并实现基于病毒基因型的预测预后?

Can machines learn the mutation signatures of SARS-CoV-2 and enable viral-genotype guided predictive prognosis?

机构信息

Tata Consultancy Services Ltd, Pune 411013, India; CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), New Delhi 110025, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India. Electronic address: https://twitter.com/NagpalSun.

Tata Consultancy Services Ltd, Pune 411013, India. Electronic address: https://twitter.com/nishal_pinna.

出版信息

J Mol Biol. 2022 Aug 15;434(15):167684. doi: 10.1016/j.jmb.2022.167684. Epub 2022 Jun 11.

Abstract

MOTIVATION

Continuous emergence of new variants through appearance/accumulation/disappearance of mutations is a hallmark of many viral diseases. SARS-CoV-2 variants have particularly exerted tremendous pressure on global healthcare system owing to their life threatening and debilitating implications. The sheer plurality of variants and huge scale of genomic data have added to the challenges of tracing the mutations/variants and their relationship to infection severity (if any).

RESULTS

We explored the suitability of virus-genotype guided machine-learning in infection prognosis and identification of features/mutations-of-interest. Total 199,519 outcome-traced genomes, representing 45,625 nucleotide-mutations, were employed. Among these, post data-cleaning, Low and High severity genomes were classified using an integrated model (employing virus genotype, epitopic-influence and patient-age) with consistently high ROC-AUC (Asia:0.97 ± 0.01, Europe:0.94 ± 0.01, N.America:0.92 ± 0.02, Africa:0.94 ± 0.07, S.America:0.93 ± 03). Although virus-genotype alone could enable high predictivity (0.97 ± 0.01, 0.89 ± 0.02, 0.86 ± 0.04, 0.95 ± 0.06, 0.9 ± 0.04), the performance was not found to be consistent and the models for a few geographies displayed significant improvement in predictivity when the influence of age and/or epitope was incorporated with virus-genotype (Wilcoxon p_BH < 0.05). Neither age or epitopic-influence or clade information could out-perform the integrated features. A sparse model (6 features), developed using patient-age and epitopic-influence of the mutations, performed reasonably well (>0.87 ± 0.03, 0.91 ± 0.01, 0.87 ± 0.03, 0.84 ± 0.08, 0.89 ± 0.05). High-performance models were employed for inferring the important mutations-of-interest using Shapley Additive exPlanations (SHAP). The changes in HLA interactions of the mutated epitopes of reference SARS-CoV-2 were then subsequently probed. Notably, we also describe the significance of a 'temporal-modeling approach' to benchmark the models linked with continuously evolving pathogens. We conclude that while machine learning can play a vital role in identifying relevant mutations and factors driving the severity, caution should be exercised in using the genotypic signatures for predictive prognosis.

摘要

动机

通过突变的出现/积累/消失,新变体的持续出现是许多病毒疾病的标志。由于 SARS-CoV-2 变体具有危及生命和使身体虚弱的影响,因此对全球医疗保健系统造成了巨大压力。变体的多样性和基因组数据的规模庞大,增加了追踪突变/变体及其与感染严重程度(如果有的话)的关系的挑战。

结果

我们探讨了病毒基因型指导的机器学习在感染预后和鉴定特征/感兴趣的突变中的适用性。总共使用了 199519 个追踪到结果的基因组,代表 45625 个核苷酸突变。在这些数据中,经过数据清理后,使用整合模型(使用病毒基因型、表位影响和患者年龄)对低严重程度和高严重程度的基因组进行了分类,该模型具有始终较高的 ROC-AUC(亚洲:0.97±0.01,欧洲:0.94±0.01,北美:0.92±0.02,非洲:0.94±0.07,南美:0.93±0.03)。尽管病毒基因型本身可以实现高预测性(0.97±0.01,0.89±0.02,0.86±0.04,0.95±0.06,0.9±0.04),但发现其性能并不一致,并且当年龄和/或表位的影响与病毒基因型结合使用时,一些地理模型的预测性能显著提高(Wilcoxon p_BH<0.05)。年龄、表位影响或进化枝信息都无法超过整合特征。使用患者年龄和突变的表位影响开发的稀疏模型(6 个特征)表现良好(>0.87±0.03,0.91±0.01,0.87±0.03,0.84±0.08,0.89±0.05)。使用 Shapley Additive exPlanations (SHAP) 推断重要的感兴趣突变。然后,进一步探测参考 SARS-CoV-2 突变表位的 HLA 相互作用的变化。值得注意的是,我们还描述了“时间建模方法”在基准与不断进化的病原体相关联的模型方面的重要性。我们得出的结论是,虽然机器学习可以在识别相关突变和驱动严重程度的因素方面发挥重要作用,但在使用基因型特征进行预测预后时应谨慎行事。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c8c/9188262/f42a993a2e2f/ga1_lrg.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验