Department of Pathology and Clinical Bioinformatics, Erasmus Medical Center (EMC), Wytemaweg, 3015 CN, Rotterdam, The Netherlands.
Department of Neurology, Leiden University Medical Center (LUMC), PO Box 9600, 2300 RC, Leiden, The Netherlands.
Orphanet J Rare Dis. 2023 Jul 27;18(1):218. doi: 10.1186/s13023-023-02785-4.
In biomedicine, machine learning (ML) has proven beneficial for the prognosis and diagnosis of different diseases, including cancer and neurodegenerative disorders. For rare diseases, however, the requirement for large datasets often prevents this approach. Huntington's disease (HD) is a rare neurodegenerative disorder caused by a CAG repeat expansion in the coding region of the huntingtin gene. The world's largest observational study for HD, Enroll-HD, describes over 21,000 participants. As such, Enroll-HD is amenable to ML methods. In this study, we pre-processed and imputed Enroll-HD with ML methods to maximise the inclusion of participants and variables. With this dataset we developed models to improve the prediction of the age at onset (AAO) and compared it to the well-established Langbehn formula. In addition, we used recurrent neural networks (RNNs) to demonstrate the utility of ML methods for longitudinal datasets, assessing driving capabilities by learning from previous participant assessments.
Simple pre-processing imputed around 42% of missing values in Enroll-HD. Also, 167 variables were retained as a result of imputing with ML. We found that multiple ML models were able to outperform the Langbehn formula. The best ML model (light gradient boosting machine) improved the prognosis of AAO compared to the Langbehn formula by 9.2%, based on root mean squared error in the test set. In addition, our ML model provides more accurate prognosis for a wider CAG repeat range compared to the Langbehn formula. Driving capability was predicted with an accuracy of 85.2%. The resulting pre-processing workflow and code to train the ML models are available to be used for related HD predictions at: https://github.com/JasperO98/hdml/tree/main .
Our pre-processing workflow made it possible to resolve the missing values and include most participants and variables in Enroll-HD. We show the added value of a ML approach, which improved AAO predictions and allowed for the development of an advisory model that can assist clinicians and participants in estimating future driving capability.
在生物医学领域,机器学习 (ML) 已被证明对癌症和神经退行性疾病等不同疾病的预后和诊断有益。然而,对于罕见病来说,对大数据集的需求往往会阻止这种方法的应用。亨廷顿病 (HD) 是一种由亨廷顿基因编码区的 CAG 重复扩展引起的罕见神经退行性疾病。世界上最大的亨廷顿病观察性研究 Enroll-HD 描述了超过 21000 名参与者。因此,Enroll-HD 适用于 ML 方法。在这项研究中,我们使用 ML 方法对 Enroll-HD 进行预处理和插补,以最大限度地纳入参与者和变量。使用这个数据集,我们开发了模型来改善发病年龄 (AAO) 的预测,并将其与成熟的 Langbehn 公式进行了比较。此外,我们还使用递归神经网络 (RNN) 展示了 ML 方法在纵向数据集上的应用,通过从之前的参与者评估中学习来评估驾驶能力。
简单的预处理用 ML 方法填补了 Enroll-HD 中约 42%的缺失值。此外,由于 ML 插补,保留了 167 个变量。我们发现,多个 ML 模型能够优于 Langbehn 公式。基于测试集中的均方根误差,最好的 ML 模型(轻梯度提升机)将 AAO 的预后提高了 9.2%。此外,与 Langbehn 公式相比,我们的 ML 模型在更宽的 CAG 重复范围内提供了更准确的预后。驾驶能力的预测准确率为 85.2%。可用于相关 HD 预测的预处理工作流程和训练 ML 模型的代码可在以下网址获得:https://github.com/JasperO98/hdml/tree/main。
我们的预处理工作流程使得解决缺失值并纳入 Enroll-HD 中的大多数参与者和变量成为可能。我们展示了 ML 方法的附加值,它提高了 AAO 的预测,并允许开发一个咨询模型,该模型可以帮助临床医生和参与者估计未来的驾驶能力。