Suppr超能文献

基于机器学习的分类器在 COVID-19 严重程度预测血浆蛋白质组学中的基准测试:通过可解释的人工智能。

Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence.

机构信息

Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece.

Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece.

出版信息

Artif Intell Med. 2023 Mar;137:102490. doi: 10.1016/j.artmed.2023.102490. Epub 2023 Jan 18.

Abstract

The SARS-CoV-2 pandemic highlighted the need for software tools that could facilitate patient triage regarding potential disease severity or even death. In this article, an ensemble of Machine Learning (ML) algorithms is evaluated in terms of predicting the severity of their condition using plasma proteomics and clinical data as input. An overview of AI-based technical developments to support COVID-19 patient management is presented outlining the landscape of relevant technical developments. Based on this review, the use of an ensemble of ML algorithms that analyze clinical and biological data (i.e., plasma proteomics) of COVID-19 patients is designed and deployed to evaluate the potential use of AI for early COVID-19 patient triage. The proposed pipeline is evaluated using three publicly available datasets for training and testing. Three ML "tasks" are defined, and several algorithms are tested through a hyperparameter tuning method to identify the highest-performance models. As overfitting is one of the typical pitfalls for such approaches (mainly due to the size of the training/validation datasets), a variety of evaluation metrics are used to mitigate this risk. In the evaluation procedure, recall scores ranged from 0.6 to 0.74 and F1-score from 0.62 to 0.75. The best performance is observed via Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) algorithms. Additionally, input data (proteomics and clinical data) were ranked based on corresponding Shapley additive explanation (SHAP) values and evaluated for their prognosticated capacity and immuno-biological credence. This "interpretable" approach revealed that our ML models could discern critical COVID-19 cases predominantly based on patient's age and plasma proteins on B cell dysfunction, hyper-activation of inflammatory pathways like Toll-like receptors, and hypo-activation of developmental and immune pathways like SCF/c-Kit signaling. Finally, the herein computational workflow is corroborated in an independent dataset and MLP superiority along with the implication of the abovementioned predictive biological pathways are corroborated. Regarding limitations of the presented ML pipeline, the datasets used in this study contain less than 1000 observations and a significant number of input features hence constituting a high-dimensional low-sample (HDLS) dataset which could be sensitive to overfitting. An advantage of the proposed pipeline is that it combines biological data (plasma proteomics) with clinical-phenotypic data. Thus, in principle, the presented approach could enable patient triage in a timely fashion if used on already trained models. However, larger datasets and further systematic validation are needed to confirm the potential clinical value of this approach. The code is available on Github: https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics.

摘要

SARS-CoV-2 大流行凸显了对软件工具的需求,这些工具可以帮助对潜在疾病严重程度甚至死亡的患者进行分诊。本文评估了一组机器学习(ML)算法,根据血浆蛋白质组学和临床数据作为输入,预测疾病的严重程度。本文概述了人工智能支持 COVID-19 患者管理的技术发展,并概述了相关技术发展的现状。基于这项综述,设计并部署了一种使用分析 COVID-19 患者临床和生物数据(即血浆蛋白质组学)的 ML 算法的集合,以评估 AI 在早期 COVID-19 患者分诊中的潜在用途。该提出的流水线使用三个公开可用的数据集进行训练和测试。定义了三个 ML“任务”,并通过超参数调整方法测试了几种算法,以确定性能最高的模型。由于这种方法通常存在过拟合的风险(主要是由于训练/验证数据集的大小),因此使用了各种评估指标来降低这种风险。在评估过程中,召回分数范围为 0.6 到 0.74,F1 分数范围为 0.62 到 0.75。通过多层感知器(MLP)和支持向量机(SVM)算法观察到最佳性能。此外,还根据相应的 Shapley 可加性解释(SHAP)值对输入数据(蛋白质组学和临床数据)进行了排名,并评估了它们的预后能力和免疫生物学可信度。这种“可解释”的方法表明,我们的 ML 模型主要可以根据患者年龄和 B 细胞功能障碍、炎症途径(如 Toll 样受体)的过度激活以及发育和免疫途径(如 SCF/c-Kit 信号)的过度激活来识别关键的 COVID-19 病例。最后,在一个独立的数据集上验证了本文提出的计算工作流程,以及 MLP 的优势以及上述预测性生物途径的意义。关于所提出的 ML 管道的限制,本研究中使用的数据集包含少于 1000 个观察值和大量输入特征,因此构成了高维低样本(HDLS)数据集,这可能容易受到过拟合的影响。所提出的管道的优点是它将生物数据(血浆蛋白质组学)与临床表型数据相结合。因此,原则上,如果在已经训练好的模型上使用,该方法可以及时进行患者分诊。然而,需要更大的数据集和进一步的系统验证来确认该方法的潜在临床价值。该代码可在 Github 上获得:https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecb4/9846931/5bd063eb4ca1/ga1_lrg.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验