Presagen, Adelaide, SA, 5000, Australia.
School of Mathematical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia.
Sci Rep. 2021 Sep 9;11(1):18005. doi: 10.1038/s41598-021-97341-0.
The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.
在训练集中检测和去除低质量数据对于实现高性能的 AI 模型至关重要。在医疗保健领域,由于不确定性或主观性,数据可能天生就低质量,但通常情况下,对数据隐私的要求限制了 AI 从业者访问原始训练数据,这意味着无法对私人患者数据进行手动视觉验证。在这里,我们描述了一种名为“不可训练数据清理”的自动识别低质量数据的新方法。该方法具有许多优点,包括保护私人患者数据;提高 AI 的泛化能力;减少训练所需的时间、成本和数据;同时更真实地报告 AI 本身的性能。此外,结果表明,不可训练数据清理可用作一种分诊工具,以识别可能需要深入评估或额外测试以支持诊断的困难临床病例。