Ivarsson Orrelid Christoffer, Rosberg Oscar, Weiner Sophia, Johansson Fredrik D, Gobom Johan, Zetterberg Henrik, Mwai Newton, Stempfle Lena
Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Rännvägen 6b, 41296, Gothenburg, Västra Götalandsregionen, Sweden.
Department of Psychiatry and Neurochemistry, The Sahlgrenska Academy at the University of Gothenburg, Wallinsgatan 6, 43141, Möndal, Västra Götalandsregionen, Sweden.
Fluids Barriers CNS. 2025 Mar 3;22(1):23. doi: 10.1186/s12987-025-00634-z.
This study explores the application of machine learning to high-dimensional proteomics datasets for identifying Alzheimer's disease (AD) biomarkers. AD, a neurodegenerative disorder affecting millions worldwide, necessitates early and accurate diagnosis for effective management.
We leverage Tandem Mass Tag (TMT) proteomics data from the cerebrospinal fluid (CSF) samples from the frontal cortex of patients with idiopathic normal pressure hydrocephalus (iNPH), a condition often comorbid with AD, with rare access to both lumbar and ventricular samples. Our methodology includes extensive data preprocessing to address batch effects and missing values, followed by the use of the Synthetic Minority Over-sampling Technique (SMOTE) for data augmentation to overcome the small sample size. We apply linear, and non-linear machine learning models, and ensemble methods, to compare iNPH patients with and without biomarker evidence of AD pathology ( or ) in a classification task.
We present a machine learning workflow for working with high-dimensional TMT proteomics data that addresses their inherent data characteristics. Our results demonstrate that batch effect correction has no or minor impact on the models' performance and robust feature selection is critical for model stability and performance, especially in the high-dimensional proteomics data setting for AD diagnostics. The results further indicated that removing features with missing values produced stronger models than imputing them, and the batch effect had minimal impact on the models Our best-performing disease-progression detection model, a random forest, achieves an AUC of 0.84 (± 0.03).
We identify several novel protein biomarkers candidates, such as FABP3 and GOT1, with potential diagnostic value for AD pathology detection, suggesting the necessity of different biomarkers for AD diagnoses for patients with iNPH, and considering different biomarkers for ventricular and lumbar CSF samples. This work underscores the importance of a meticulous machine learning process in enhancing biomarker discovery. Our study also provides insights in translating biomarkers from other central nervous system diseases like iNPH, and both ventricular and lumbar CSF samples for biomarker discovery, providing a foundation for future research and clinical applications.
本研究探索机器学习在高维蛋白质组学数据集中的应用,以识别阿尔茨海默病(AD)生物标志物。AD是一种影响全球数百万人的神经退行性疾病,需要早期准确诊断以进行有效管理。
我们利用来自特发性正常压力脑积水(iNPH)患者额叶皮质脑脊液(CSF)样本的串联质谱标签(TMT)蛋白质组学数据,iNPH常与AD共病,且很少能同时获取腰椎和脑室样本。我们的方法包括广泛的数据预处理以解决批次效应和缺失值问题,随后使用合成少数过采样技术(SMOTE)进行数据增强以克服样本量小的问题。我们应用线性和非线性机器学习模型以及集成方法,在分类任务中比较有和没有AD病理生物标志物证据(或)的iNPH患者。
我们展示了一个用于处理高维TMT蛋白质组学数据的机器学习工作流程,该流程解决了其固有的数据特征。我们的结果表明,批次效应校正对模型性能没有或只有轻微影响,稳健的特征选择对于模型稳定性和性能至关重要,特别是在用于AD诊断的高维蛋白质组学数据设置中。结果进一步表明,去除有缺失值的特征比插补这些特征能产生更强的模型,且批次效应对模型的影响最小。我们表现最佳的疾病进展检测模型——随机森林,AUC达到0.84(±0.03)。
我们确定了几种新型蛋白质生物标志物候选物,如FABP3和GOT1,它们对AD病理检测具有潜在诊断价值,这表明对于iNPH患者的AD诊断需要不同的生物标志物,并考虑脑室和腰椎CSF样本的不同生物标志物。这项工作强调了细致的机器学习过程在增强生物标志物发现方面的重要性。我们的研究还为从其他中枢神经系统疾病(如iNPH)以及脑室和腰椎CSF样本中转化生物标志物以进行生物标志物发现提供了见解,为未来的研究和临床应用奠定了基础。