特征选择中的数据泄露导致神经精神生物标志物预测精度膨胀。

Inflated prediction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection.

机构信息

Department of Electronics and Information, Korea University, 2511, Sejong-ro, Jochiwon-eup, Sejong-si, 30019, Republic of Korea.

Psychiatry Department, Ilsan Paik Hospital, Inje University, Goyang, Republic of Korea.

出版信息

Sci Rep. 2021 Apr 12;11(1):7980. doi: 10.1038/s41598-021-87157-3.

Abstract

In recent years, machine learning techniques have been frequently applied to uncovering neuropsychiatric biomarkers with the aim of accurately diagnosing neuropsychiatric diseases and predicting treatment prognosis. However, many studies did not perform cross validation (CV) when using machine learning techniques, or others performed CV in an incorrect manner, leading to significantly biased results due to overfitting problem. The aim of this study is to investigate the impact of CV on the prediction performance of neuropsychiatric biomarkers, in particular, for feature selection performed with high-dimensional features. To this end, we evaluated prediction performances using both simulation data and actual electroencephalography (EEG) data. The overall prediction accuracies of the feature selection method performed outside of CV were considerably higher than those of the feature selection method performed within CV for both the simulation and actual EEG data. The differences between the prediction accuracies of the two feature selection approaches can be thought of as the amount of overfitting due to selection bias. Our results indicate the importance of correctly using CV to avoid biased results of prediction performance of neuropsychiatric biomarkers.

摘要

近年来,机器学习技术已被广泛应用于揭示神经精神生物标志物,旨在准确诊断神经精神疾病和预测治疗预后。然而,许多研究在使用机器学习技术时并未进行交叉验证(CV),或者其他研究以不正确的方式进行 CV,导致由于过度拟合问题导致结果存在显著偏差。本研究旨在探讨 CV 对神经精神生物标志物预测性能的影响,特别是对于高维特征进行特征选择的情况。为此,我们使用模拟数据和实际脑电图(EEG)数据评估了预测性能。对于模拟和实际 EEG 数据,在 CV 之外执行的特征选择方法的整体预测准确性明显高于在 CV 内执行的特征选择方法的预测准确性。两种特征选择方法的预测准确性之间的差异可以被认为是由于选择偏差导致的过度拟合量。我们的结果表明正确使用 CV 避免神经精神生物标志物预测性能的有偏结果的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3951/8042090/7380bac851c3/41598_2021_87157_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索