Chaibub Neto Elias, Pratap Abhishek, Perumal Thanneer M, Tummalacherla Meghasyam, Snyder Phil, Bot Brian M, Trister Andrew D, Friend Stephen H, Mangravite Lara, Omberg Larsson
1Sage Bionetworks, Seattle, USA.
2Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, USA.
NPJ Digit Med. 2019 Oct 11;2:99. doi: 10.1038/s41746-019-0178-x. eCollection 2019.
Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets ("record-wise" data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of "identity confounding." In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.
收集高维纵向数字健康数据有潜力支持多种研究和临床应用,包括诊断和纵向健康跟踪。处理这些数据并为数字诊断提供信息的算法通常是使用从一组个体收集的多个重复测量生成的训练集和测试集开发的。然而,在预测性能的分析评估中,重复测量的纳入并非总是得到适当考虑。将每个个体的重复测量同时分配到训练集和测试集(“逐记录”数据拆分)是一种常见做法,由于存在“身份混淆”,可能会导致对预测误差的严重低估。本质上,除了诊断信号外,这些模型还学会了识别个体。在这里,我们提出了一种方法,可用于有效计算使用逐记录数据拆分开发的分类器所学到的身份混淆量。通过将此方法应用于几个真实数据集,我们证明身份混淆在数字健康研究中是一个严重问题,并且基于机器学习的应用需要避免逐记录数据拆分。