Suppr超能文献

Snorkel:通过弱监督快速创建训练数据。

Snorkel: rapid training data creation with weak supervision.

作者信息

Ratner Alexander, Bach Stephen H, Ehrenberg Henry, Fries Jason, Wu Sen, Ré Christopher

机构信息

1Stanford University, Stanford, CA USA.

2Computer Science Department, Brown University, Providence, RI USA.

出版信息

VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.

Abstract

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models faster and increase predictive performance an average versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides average improvements to predictive performance over prior heuristic approaches and comes within an average of the predictive performance of large hand-curated training sets.

摘要

标注训练数据日益成为部署机器学习系统的最大瓶颈。我们推出了Snorkel,这是首个此类系统,它能让用户在无需手动标注任何训练数据的情况下训练最先进的模型。相反,用户编写表达任意启发式规则的标注函数,这些规则的准确性和相关性可能未知。Snorkel通过纳入我们最近提出的机器学习范式——数据编程的首个端到端实现,在无法获取真实标签的情况下对其输出进行去噪。基于我们过去一年与公司、机构和研究实验室合作的经验,我们展示了一个用于编写标注函数的灵活接口层。在一项用户研究中,主题专家构建模型的速度更快,与七小时的手动标注相比,预测性能平均有所提高。我们研究了这种新环境下的建模权衡,并提出了一种用于自动进行权衡决策的优化器,每次管道执行可实现高达 的加速。在与美国退伍军人事务部和美国食品药品监督管理局的两次合作中,以及在代表其他部署的四个开源文本和图像数据集上,Snorkel相对于先前的启发式方法平均提高了预测性能,并且平均而言与大型人工精心策划的训练集的预测性能相差 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b72/7075849/69b81ddc13c3/778_2019_552_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验