School of Public Health, Xi'an Medical University, Xi'an, 710021, Shaanxi, China.
Sci Rep. 2024 Nov 6;14(1):26992. doi: 10.1038/s41598-024-78493-1.
Despite the end of the global Coronavirus Disease 2019 (COVID-19) pandemic, the risk factors for COVID-19 severity continue to be a pivotal area of research. Specifically, studying the impact of the genomic diversity of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) on COVID-19 severity is crucial for predicting severe outcomes. Therefore, this study aimed to investigate the impact of the SARS-CoV-2 genome sequence, genotype, patient age, gender, and vaccination status on the severity of COVID-19, and to develop accurate and robust prediction models. The training set (n = 12,038), primary testing set (n = 4,006), and secondary testing set (n = 2,845) consist of SARS-CoV-2 genome sequences with patient information, which were obtained from Global Initiative on Sharing all Individual Data (GISAID) spanning over four years. Four machine learning methods were employed to construct prediction models. By extracting SARS-CoV-2 genomic features, optimizing model parameters, and integrating models, this study improved the prediction accuracy. Furthermore, Shapley Additive exPlanes (SHAP) was applied to analyze the interpretability of the model and to identify risk factors, providing insights for the management of severe cases. The proposed ensemble model achieved an F-score of 88.842% and an Area Under the Curve (AUC) of 0.956 on the global testing dataset. In addition to factors such as patient age, gender, and vaccination status, over 40 amino acid site mutation characteristics were identified to have a significant impact on the severity of COVID-19. This work has the potential to facilitate the early identification of COVID-19 patients with high risks of severe illness, thus effectively reducing the rates of severe cases and mortality.
尽管全球 2019 年冠状病毒病(COVID-19)大流行已经结束,但 COVID-19 严重程度的危险因素仍是研究的重点。具体而言,研究严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)基因组多样性对 COVID-19 严重程度的影响对于预测严重结局至关重要。因此,本研究旨在探讨 SARS-CoV-2 基因组序列、基因型、患者年龄、性别和疫苗接种状态对 COVID-19 严重程度的影响,并开发准确和稳健的预测模型。训练集(n=12038)、主测试集(n=4006)和次测试集(n=2845)包含来自全球共享所有个体数据倡议(GISAID)的 SARS-CoV-2 基因组序列和患者信息,这些数据跨越四年。采用四种机器学习方法构建预测模型。通过提取 SARS-CoV-2 基因组特征、优化模型参数和集成模型,提高了预测精度。此外,还应用 Shapley Additive exPlanes(SHAP)分析模型的可解释性,并确定风险因素,为重症病例的管理提供了见解。所提出的集成模型在全球测试数据集上的 F-score 为 88.842%,AUC 为 0.956。除了患者年龄、性别和疫苗接种状态等因素外,还确定了超过 40 个氨基酸位点突变特征对 COVID-19 的严重程度有重大影响。这项工作有可能促进 COVID-19 患者高风险严重疾病的早期识别,从而有效降低严重病例和死亡率。