帕金森病亚型的最优识别和预测的特征选择和机器学习方法。

Feature selection and machine learning methods for optimal identification and prediction of subtypes in Parkinson's disease.

机构信息

Department of Energy Engineering and Physics, Amirkabir University of Technology, Tehran, Iran; Department of Physics & Astronomy, University of British Columbia, Vancouver, BC, Canada.

Department of Energy Engineering and Physics, Amirkabir University of Technology, Tehran, Iran.

出版信息

Comput Methods Programs Biomed. 2021 Jul;206:106131. doi: 10.1016/j.cmpb.2021.106131. Epub 2021 Apr 29.

DOI:10.1016/j.cmpb.2021.106131

PMID:34015757

Abstract

OBJECTIVES

The present work focuses on assessment of Parkinson's disease (PD), including both PD subtype identification (unsupervised task) and prediction (supervised task). We specifically investigate optimal feature selection and machine learning algorithms for these tasks.

METHODS

We selected 885 PD subjects as derived from longitudinal datasets (years 0-4; Parkinson's Progressive Marker Initiative), and investigated 981 features including motor, non-motor, and imaging features (SPECT-based radiomics features extracted using our standardized SERA software). Two different hybrid machine learning systems (HMLS) were constructed and applied to the data in order to select optimal combinations in both tasks: (i) identification of subtypes in PD (unsupervised-clustering), and (ii) prediction of these subtypes in year 4 (supervised-classification). From the original data based on years 0 (baseline) and 1, we created new datasets as inputs to the prediction task: (i,ii) CSD0 and CSD01: cross-sectional datasets from year 0 only and both years 0 & 1, respectively; (iii) TD01: timeless dataset from both years 0 & 1. In addition, PD subtype in year 4 was considered as outcome. Finally, high score features were derived via ensemble voting based on their prioritizations from feature selector algorithms (FSAs).

RESULTS

In clustering task, the most optimal combinations (out of 981) were selected by individual FSAs to enable high correlation compared to using all features (arriving at 547). In prediction task, we were able to select optimal combinations, resulting in an accuracy >90% only for timeless dataset (TD01); there, we were able to select the most optimal combination using 77 features, directly selected by FSAs. In both tasks, however, using combination of only high score features from ensemble voting did not enable acceptable performances, showing optimal feature selection via individual FSAs to be more effective.

CONCLUSION

Combining non-imaging information with SPECT-based radiomics features, and optimal utilization of HMLSs, can enable robust identification of subtypes as well as appropriate prediction of these subtypes in PD patients. Moreover, use of timeless dataset, beyond cross-sectional datasets, enabled predictive accuracies over 90%. Overall, we showed that radiomics features extracted from SPECT images are important in clustering as well as prediction of PD subtypes.

摘要

目的

本研究旨在评估帕金森病（PD），包括 PD 亚型识别（无监督任务）和预测（监督任务）。我们专门研究了这些任务的最佳特征选择和机器学习算法。

方法

我们从纵向数据集（0-4 年；帕金森进展标志物倡议）中选择了 885 名 PD 患者，并研究了 981 个特征，包括运动、非运动和成像特征（使用我们标准化的 SERA 软件提取的 SPECT 基于的放射组学特征）。构建了两个不同的混合机器学习系统（HMLS），并将其应用于数据，以在两个任务中选择最佳组合：（i）PD 亚型的识别（无监督聚类），和（ii）第 4 年这些亚型的预测（监督分类）。基于基于第 0 年（基线）和第 1 年的原始数据，我们创建了新的数据集作为预测任务的输入：（i，ii）CSD0 和 CSD01：仅来自第 0 年和第 0 年和第 1 年的横断面数据集；（iii）TD01：来自第 0 年和第 1 年的无时间数据集。此外，第 4 年的 PD 亚型被视为结果。最后，基于特征选择算法（FSAs）的优先级，通过集成投票导出高分特征。

结果

在聚类任务中，个体 FSAs 选择了最佳组合（在 981 个组合中），与使用所有特征相比，相关性更高（达到 547 个）。在预测任务中，我们能够选择最佳组合，仅对于无时间数据集（TD01），精度>90%；在那里，我们能够使用 77 个特征直接由 FSAs 选择最佳组合。然而，在这两个任务中，使用集成投票的高分特征组合并不能实现可接受的性能，表明通过个体 FSAs 进行最佳特征选择更为有效。

结论

将非成像信息与 SPECT 基于的放射组学特征相结合，以及最佳利用 HMLS，可以实现 PD 患者亚型的稳健识别以及对这些亚型的适当预测。此外，使用无时间数据集，而不仅仅是横断面数据集，可实现超过 90%的预测准确性。总的来说，我们表明从 SPECT 图像中提取的放射组学特征在 PD 亚型的聚类和预测中都很重要。