Allgaier Johannes, Pryss Rüdiger
Institute of Clinical Epidemiology and Biometry, Julius-Maximilians-University Würzburg, Josef-Schneider-Straße 2, Würzburg, Germany.
Commun Med (Lond). 2024 Apr 22;4(1):76. doi: 10.1038/s43856-024-00468-0.
Machine learning (ML) models are evaluated in a test set to estimate model performance after deployment. The design of the test set is therefore of importance because if the data distribution after deployment differs too much, the model performance decreases. At the same time, the data often contains undetected groups. For example, multiple assessments from one user may constitute a group, which is usually the case in mHealth scenarios.
In this work, we evaluate a model's performance using several cross-validation train-test-split approaches, in some cases deliberately ignoring the groups. By sorting the groups (in our case: Users) by time, we additionally simulate a concept drift scenario for better external validity. For this evaluation, we use 7 longitudinal mHealth datasets, all containing Ecological Momentary Assessments (EMA). Further, we compared the model performance with baseline heuristics, questioning the essential utility of a complex ML model.
Hidden groups in the dataset leads to overestimation of ML performance after deployment. For prediction, a user's last completed questionnaire is a reasonable heuristic for the next response, and potentially outperforms a complex ML model. Because we included 7 studies, low variance appears to be a more fundamental phenomenon of mHealth datasets.
The way mHealth-based data are generated by EMA leads to questions of user and assessment level and appropriate validation of ML models. Our analysis shows that further research needs to follow to obtain robust ML models. In addition, simple heuristics can be considered as an alternative for ML. Domain experts should be consulted to find potentially hidden groups in the data.
机器学习(ML)模型在测试集中进行评估,以估计部署后的模型性能。因此,测试集的设计至关重要,因为如果部署后的数据分布差异过大,模型性能就会下降。同时,数据中常常包含未被检测到的组。例如,来自同一用户的多次评估可能构成一个组,这在移动健康(mHealth)场景中通常如此。
在这项工作中,我们使用几种交叉验证训练 - 测试分割方法来评估模型性能,在某些情况下故意忽略这些组。通过按时间对组(在我们的案例中:用户)进行排序,我们还模拟了概念漂移场景以提高外部有效性。对于此评估,我们使用了7个纵向移动健康数据集,所有数据集都包含生态瞬时评估(EMA)。此外,我们将模型性能与基线启发式方法进行了比较,质疑复杂ML模型的基本效用。
数据集中的隐藏组会导致部署后对ML性能的高估。对于预测,用户最后完成的问卷对于下一次回答是一种合理的启发式方法,并且可能优于复杂的ML模型。由于我们纳入了7项研究,低方差似乎是移动健康数据集更基本的现象。
基于移动健康的数据由EMA生成的方式引发了关于用户和评估层面以及ML模型适当验证的问题。我们的分析表明,需要进一步开展研究以获得稳健的ML模型。此外,简单的启发式方法可被视为ML的替代方法。应咨询领域专家以发现数据中潜在的隐藏组。