Saeb Sohrab, Lonini Luca, Jayaraman Arun, Mohr David C, Kording Konrad P
Department of Preventive Medicine, Northwestern University, 10th floor, Rubloff Bldg, 750 N Lake Shore Dr, Chicago, IL 60611, USA.
Department of Physical Medicine and Rehabilitation, Northwestern University, 345 E Superior St, Suite 1479, Chicago, IL 60611, USA.
Gigascience. 2017 May 1;6(5):1-9. doi: 10.1093/gigascience/gix019.
The availability of smartphone and wearable sensor technology is leading to a rapid accumulation of human subject data, and machine learning is emerging as a technique to map those data into clinical predictions. As machine learning algorithms are increasingly used to support clinical decision making, it is vital to reliably quantify their prediction accuracy. Cross-validation (CV) is the standard approach where the accuracy of such algorithms is evaluated on part of the data the algorithm has not seen during training. However, for this procedure to be meaningful, the relationship between the training and the validation set should mimic the relationship between the training set and the dataset expected for the clinical use. Here we compared two popular CV methods: record-wise and subject-wise. While the subject-wise method mirrors the clinically relevant use-case scenario of diagnosis in newly recruited subjects, the record-wise strategy has no such interpretation. Using both a publicly available dataset and a simulation, we found that record-wise CV often massively overestimates the prediction accuracy of the algorithms. We also conducted a systematic review of the relevant literature, and found that this overly optimistic method was used by almost half of the retrieved studies that used accelerometers, wearable sensors, or smartphones to predict clinical outcomes. As we move towards an era of machine learning-based diagnosis and treatment, using proper methods to evaluate their accuracy is crucial, as inaccurate results can mislead both clinicians and data scientists.
智能手机和可穿戴传感器技术的普及导致人类受试者数据迅速积累,机器学习正逐渐成为一种将这些数据映射为临床预测的技术。随着机器学习算法越来越多地用于支持临床决策,可靠地量化其预测准确性至关重要。交叉验证(CV)是一种标准方法,通过该方法可以在算法训练期间未见过的数据部分上评估此类算法的准确性。然而,为了使这个过程有意义,训练集和验证集之间的关系应该模仿训练集和临床使用预期数据集之间的关系。在这里,我们比较了两种流行的交叉验证方法:逐记录和逐受试者。虽然逐受试者方法反映了新招募受试者中诊断的临床相关用例场景,但逐记录策略没有这样的解释。通过使用公开可用的数据集和模拟,我们发现逐记录交叉验证通常会大幅高估算法的预测准确性。我们还对相关文献进行了系统综述,发现几乎一半使用加速度计、可穿戴传感器或智能手机来预测临床结果的检索研究都使用了这种过于乐观的方法。随着我们迈向基于机器学习的诊断和治疗时代,使用适当的方法评估其准确性至关重要,因为不准确的结果可能会误导临床医生和数据科学家。