Suppr超能文献

使用稀疏典型相关分析和协同学习的多模态数据融合:一项新冠肺炎队列研究

Multimodal data fusion using sparse canonical correlation analysis and cooperative learning: a COVID-19 cohort study.

作者信息

Er Ahmet Gorkem, Ding Daisy Yi, Er Berrin, Uzun Mertcan, Cakmak Mehmet, Sadee Christoph, Durhan Gamze, Ozmen Mustafa Nasuh, Tanriover Mine Durusu, Topeli Arzu, Aydin Son Yesim, Tibshirani Robert, Unal Serhat, Gevaert Olivier

机构信息

Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA.

Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, 06800, Ankara, Turkey.

出版信息

NPJ Digit Med. 2024 May 7;7(1):117. doi: 10.1038/s41746-024-01128-2.

Abstract

Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients: Intensive care unit admission. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (cor(Xu, Zv) = 0.596, p value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.

摘要

通过技术创新,可以利用高维、多尺度生物医学数据从多个视角检查患者队列,以对临床表型进行分类并预测结果。在此,我们旨在展示我们在新冠病毒患者队列中使用无监督和有监督稀疏线性方法分析多模态数据的方法。这项针对149名成年患者的前瞻性队列研究在一家三级医疗学术中心进行。首先,我们使用稀疏典型相关分析(CCA)来识别和量化不同数据模态之间的关系,包括病毒基因组测序、影像学、临床数据和实验室结果。然后,我们使用协同学习来预测新冠病毒患者的临床结局:重症监护病房收治情况。我们发现,代表严重疾病和急性期反应的血清生物标志物与左肺下叶频率通道中的原始和小波放射组学特征相关(cor(Xu, Zv) = 0.596,p值<0.001)。在放射组学特征中,基于直方图的报告偏度、峰度和均匀性的一阶特征具有最低的负系数,而与熵相关的特征具有最高的正系数。此外,对临床数据和实验室结果的无监督分析有助于深入了解不同的临床表型。利用全球病毒基因组数据库的可得性,我们证明Word2Vec自然语言处理模型可用于病毒基因组编码。它不仅能区分主要的严重急性呼吸综合征冠状病毒2(SARS-CoV-2)变体,还能保留它们之间的系统发育关系。我们使用Word2Vec编码的四重模型在有监督任务中取得了更好的预测结果。该模型的曲线下面积(AUC)和准确率分别为0.87和0.77。我们的研究表明,稀疏CCA分析和协同学习是处理高维、多模态数据以在无监督和有监督任务中研究多变量关联的强大技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f8df/11076490/8e88664f0b0f/41746_2024_1128_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验