Suppr超能文献

跨模态数据编程助力快速医学机器学习。

Cross-Modal Data Programming Enables Rapid Medical Machine Learning.

作者信息

Dunnmon Jared A, Ratner Alexander J, Saab Khaled, Khandwala Nishith, Markert Matthew, Sagreiya Hersh, Goldman Roger, Lee-Messer Christopher, Lungren Matthew P, Rubin Daniel L, Ré Christopher

机构信息

Department of Computer Science, Stanford University, Stanford, CA, USA.

These authors contributed equally.

出版信息

Patterns (N Y). 2020 May 8;1(2). doi: 10.1016/j.patter.2020.100019. Epub 2020 Apr 28.

Abstract

A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision called data programming has shown promise in addressing this challenge. Data programming now underpins many deployed machine-learning systems in the technology industry, even for critical applications. We propose a new technique for applying data programming to the problem of cross-modal weak supervision in medicine, wherein weak labels derived from an auxiliary modality (e.g., text) are used to train models over a different target modality (e.g., images). We evaluate our approach on diverse clinical tasks via direct comparison to institution-scale, hand-labeled datasets. We find that our supervision technique increases model performance by up to 6 points area under the receiver operating characteristic curve (ROC-AUC) over baseline methods by improving both coverage and quality of the weak labels. Our approach yields models that on average perform within 1.75 points ROC-AUC of those supervised with physician-years of hand labeling and outperform those supervised with physician-months of hand labeling by 10.25 points ROC-AUC, while using only person-days of developer time and clinician work-a time saving of 96%. Our results suggest that modern weak supervision techniques such as data programming may enable more rapid development and deployment of clinically useful machine-learning models.

摘要

开发具有临床影响力的机器学习模型的一个主要瓶颈是缺乏用于模型监督的标记训练数据。因此,医学研究人员越来越多地转向较弱、噪声较大的监督来源,例如利用从非结构化文本报告中提取的信息来监督图像分类。弱监督中的一个关键挑战是如何整合质量可能不同且存在相关误差的信息源。最近,一种名为数据编程的弱监督统计理论在应对这一挑战方面显示出了前景。数据编程如今已成为科技行业许多已部署机器学习系统的基础,甚至在关键应用中也是如此。我们提出了一种新技术,将数据编程应用于医学中的跨模态弱监督问题,即在这种情况下,从辅助模态(例如文本)派生的弱标签用于训练针对不同目标模态(例如图像)的模型。我们通过与机构规模的手工标记数据集进行直接比较,在各种临床任务上评估了我们的方法。我们发现,我们的监督技术通过提高弱标签的覆盖范围和质量,使模型性能在接收器操作特征曲线(ROC-AUC)下的面积比基线方法提高了多达6个百分点。我们的方法所产生的模型,其平均性能在ROC-AUC上比用医生多年手工标记监督的模型低1.75个百分点以内,并且在ROC-AUC上比用医生数月手工标记监督的模型高出10.25个百分点,同时仅使用了开发人员几天的时间和临床医生少量的工作时间——节省了96%的时间。我们的结果表明,诸如数据编程之类的现代弱监督技术可能会使临床上有用的机器学习模型的开发和部署更加迅速。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e951/7660379/9ef3df15e117/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验