基于机器学习利用血液基因表达和临床数据进行阿尔茨海默病阶段诊断：一项比较研究。

Machine Learning-Based Alzheimer's Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation.

作者信息

Sarma Manash, Chatterjee Subarna

机构信息

Department of Computer Science and Engineering, Faculty of Engineering and Technology, Technology Campus (Peenya Campus), Ramaiah University of Applied Sciences, Bengaluru 560058, India.

出版信息

Diagnostics (Basel). 2025 Jan 17;15(2):211. doi: 10.3390/diagnostics15020211.

DOI:10.3390/diagnostics15020211

PMID:39857095

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11765009/

Abstract

This study presents a comparative analysis of the multistage diagnosis of Alzheimer's disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer's Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced. The investigation obtained the highest multiclassification performance to date in the multistage diagnosis of Alzheimer's disease utilizing the blood gene expression profiles of Alzheimer's Disease Neuroimaging Initiative (ADNI) participants. Based on the performance results obtained, and other factors such as early prediction capabilities, this study compares the efficacies of the two types of biomarkers for multistage diagnosis. This study presents the sole investigation in which multiclassification-based AD stage diagnosis was conducted utilizing blood gene expression data. We obtained the best multiclassification result in both modalities of the ADNI data in terms of F1-score and were able to identify new genetic biomarkers. The combination of the XGBoost and SFBS (Sequential Floating Backward Selection) methods was used to select the features. We were able to select the 95 most effective gene probe sets out of 49,386. For the clinical study data, eight of the most effective biomarkers were selected using SFBS. A deep learning (DL) classifier was used to identify the stages-cognitive normal (CN), mild cognitive impairment (MCI), and Alzheimer's disease (AD)/dementia. DL, support vector machine (SVM), gradient boosting (GB), and random forest (RF) classifiers were used for the AD stage detection from gene expression profile data. Because of the high data imbalance in genomic data, borderline oversampling/data augmentation was applied in the model training and original samples for validation. Utilizing clinical data, the highest ROC AUC scores attained were 0.989, 0.927, and 0.907 for the identification of the CN, MCI, and dementia stages, respectively. The highest F1 scores achieved were 0.971, 0.939, and 0.886. Employing gene expression data, we obtained ROC AUC scores of 0.763, 0.761, and 0.706 for the CN, MCI, and dementia stages, respectively, and F1 scores of 0.71, 0.77, and 0.53 for CN, MCI, and dementia, respectively. This represents the best outcome to date for AD stage diagnosis from ADNI blood gene expression profile data utilizing multiclassification techniques. The results indicated that our multiclassification model effectively manages the imbalanced data of a high-dimension, low-sample-size (HDLSS) nature to identify samples of the minority class. MAPK14, PLG, FZD2, FXYD6, and TEP1 are among the novel genes identified as being associated with AD risk.

摘要

本研究对阿尔茨海默病（AD）的多阶段诊断进行了比较分析，包括轻度认知障碍（MCI），使用了两种不同类型的生物标志物：血液基因表达和临床生物标志物样本。这两种样本均取自阿尔茨海默病神经影像倡议（ADNI）的参与者，并利用基于机器学习（ML）的多分类器进行独立分析。本研究将基于机器学习的新型数据增强技术应用于高维、低样本量（HDLSS）且本质上高度不平衡的基因表达谱数据。该研究利用阿尔茨海默病神经影像倡议（ADNI）参与者的血液基因表达谱，在阿尔茨海默病的多阶段诊断中取得了迄今为止最高的多分类性能。基于所获得的性能结果以及早期预测能力等其他因素，本研究比较了这两种类型生物标志物在多阶段诊断中的有效性。本研究是唯一一项利用血液基因表达数据进行基于多分类的AD阶段诊断的调查。在F1分数方面，我们在ADNI数据的两种模式中均获得了最佳的多分类结果，并且能够识别新的遗传生物标志物。使用XGBoost和SFBS（顺序浮动后向选择）方法的组合来选择特征。我们能够从49386个基因探针集中选出95个最有效的基因探针集。对于临床研究数据，使用SFBS选择了8个最有效的生物标志物。使用深度学习（DL）分类器来识别认知正常（CN）、轻度认知障碍（MCI）和阿尔茨海默病（AD）/痴呆阶段。使用DL、支持向量机（SVM）、梯度提升（GB）和随机森林（RF）分类器从基因表达谱数据中进行AD阶段检测。由于基因组数据中存在高度的数据不平衡，在模型训练和用于验证的原始样本中应用了边界过采样/数据增强。利用临床数据，识别CN、MCI和痴呆阶段时获得的最高ROC AUC分数分别为0.989、0.927和0.907。获得的最高F1分数分别为0.971、0.939和0.886。利用基因表达数据，CN、MCI和痴呆阶段的ROC AUC分数分别为0.763、0.761和0.706，CN、MCI和痴呆的F1分数分别为0.71、0.77和0.53。这代表了利用多分类技术从ADNI血液基因表达谱数据进行AD阶段诊断的迄今为止的最佳结果。结果表明，我们的多分类模型有效地管理了高维、低样本量（HDLSS）性质的不平衡数据，以识别少数类别的样本。MAPK14、PLG、FZD2、FXYD6和TEP1是被确定与AD风险相关的新基因。