Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
J Am Med Inform Assoc. 2023 Dec 22;31(1):274-280. doi: 10.1093/jamia/ocad178.
The pitfalls of label leakage, contamination of model input features with outcome information, are well established. Unfortunately, avoiding label leakage in clinical prediction models requires more nuance than the common advice of applying "no time machine rule."
We provide a framework for contemplating whether and when model features pose leakage concerns by considering the cadence, perspective, and applicability of predictions. To ground these concepts, we use real-world clinical models to highlight examples of appropriate and inappropriate label leakage in practice.
Finally, we provide recommendations to support clinical and technical stakeholders as they evaluate the leakage tradeoffs associated with model design, development, and implementation decisions. By providing common language and dimensions to consider when designing models, we hope the clinical prediction community will be better prepared to develop statistically valid and clinically useful machine learning models.
标签泄露的陷阱,即模型输入特征与结果信息的污染,已经得到充分证实。不幸的是,要避免临床预测模型中的标签泄露,需要比常见的“不使用时间机器规则”的建议更细致。
我们通过考虑预测的节奏、视角和适用性,提供了一个框架来思考模型特征是否存在以及何时存在泄漏问题。为了说明这些概念,我们使用真实世界的临床模型来突出实践中适当和不适当的标签泄露的例子。
最后,我们提供了一些建议,以支持临床和技术利益相关者在评估与模型设计、开发和实施决策相关的泄漏权衡时做出决策。通过为设计模型时需要考虑的内容提供通用语言和维度,我们希望临床预测社区能够更好地准备开发统计上有效和临床上有用的机器学习模型。