一种使用RNA测序、临床和合并症数据预测COVID-19严重程度的机器学习模型。

A Machine Learning Model for the Prediction of COVID-19 Severity Using RNA-Seq, Clinical, and Co-Morbidity Data.

作者信息

Sethi Sahil, Shakyawar Sushil, Reddy Athreya S, Patel Jai Chand, Guda Chittibabu

机构信息

Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68105, USA.

Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA.

出版信息

Diagnostics (Basel). 2024 Jun 18;14(12):1284. doi: 10.3390/diagnostics14121284.

DOI:10.3390/diagnostics14121284

PMID:38928699

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11202902/

Abstract

The premise for this study emanated from the need to understand SARS-CoV-2 infections at the molecular level and to develop predictive tools for managing COVID-19 severity. With the varied clinical outcomes observed among infected individuals, creating a reliable machine learning (ML) model for predicting the severity of COVID-19 became paramount. Despite the availability of large-scale genomic and clinical data, previous studies have not effectively utilized multi-modality data for disease severity prediction using data-driven approaches. Our primary goal is to predict COVID-19 severity using a machine-learning model trained on a combination of patients' gene expression, clinical features, and co-morbidity data. Employing various ML algorithms, including Logistic Regression (LR), XGBoost (XG), Naïve Bayes (NB), and Support Vector Machine (SVM), alongside feature selection methods, we sought to identify the best-performing model for disease severity prediction. The results highlighted XG as the superior classifier, with 95% accuracy and a 0.99 AUC (Area Under the Curve), for distinguishing severity groups. Additionally, the SHAP analysis revealed vital features contributing to prediction, including several genes such as COX14, LAMB2, DOLK, SDCBP2, RHBDL1, and IER3-AS1. Notably, two clinical features, the absolute neutrophil count and Viremia Categories, emerged as top contributors. Integrating multiple data modalities has significantly improved the accuracy of disease severity prediction compared to using any single modality. The identified features could serve as biomarkers for COVID-19 prognosis and patient care, allowing clinicians to optimize treatment strategies and refine clinical decision-making processes for enhanced patient outcomes.

摘要

本研究的前提源于在分子水平上了解新型冠状病毒2（SARS-CoV-2）感染以及开发用于管理新冠肺炎严重程度的预测工具的需求。鉴于在受感染个体中观察到的临床结果各异，创建一个可靠的用于预测新冠肺炎严重程度的机器学习（ML）模型变得至关重要。尽管有大规模的基因组和临床数据，但先前的研究尚未有效地利用多模态数据，通过数据驱动的方法进行疾病严重程度预测。我们的主要目标是使用一个基于患者基因表达、临床特征和合并症数据组合训练的机器学习模型来预测新冠肺炎的严重程度。我们采用了各种ML算法，包括逻辑回归（LR）、极端梯度提升（XG）、朴素贝叶斯（NB）和支持向量机（SVM），并结合特征选择方法，试图找出用于疾病严重程度预测的性能最佳的模型。结果表明，XG是区分严重程度组的 superior 分类器，准确率为95%，曲线下面积（AUC）为0.99。此外，SHAP分析揭示了对预测有重要贡献的特征，包括COX14、LAMB2、DOLK、SDCBP2、RHBDL1和IER3-AS1等几个基因。值得注意的是，两个临床特征，即绝对中性粒细胞计数和病毒血症类别，成为主要贡献因素。与使用任何单一模态相比，整合多种数据模态显著提高了疾病严重程度预测的准确性。所确定的特征可作为新冠肺炎预后和患者护理的生物标志物，使临床医生能够优化治疗策略并完善临床决策过程，以改善患者预后。