Ratner Alexander, Bach Stephen H, Ehrenberg Henry, Fries Jason, Wu Sen, Ré Christopher
1Stanford University, Stanford, CA USA.
2Computer Science Department, Brown University, Providence, RI USA.
VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models faster and increase predictive performance an average versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides average improvements to predictive performance over prior heuristic approaches and comes within an average of the predictive performance of large hand-curated training sets.
标注训练数据日益成为部署机器学习系统的最大瓶颈。我们推出了Snorkel,这是首个此类系统,它能让用户在无需手动标注任何训练数据的情况下训练最先进的模型。相反,用户编写表达任意启发式规则的标注函数,这些规则的准确性和相关性可能未知。Snorkel通过纳入我们最近提出的机器学习范式——数据编程的首个端到端实现,在无法获取真实标签的情况下对其输出进行去噪。基于我们过去一年与公司、机构和研究实验室合作的经验,我们展示了一个用于编写标注函数的灵活接口层。在一项用户研究中,主题专家构建模型的速度更快,与七小时的手动标注相比,预测性能平均有所提高。我们研究了这种新环境下的建模权衡,并提出了一种用于自动进行权衡决策的优化器,每次管道执行可实现高达 的加速。在与美国退伍军人事务部和美国食品药品监督管理局的两次合作中,以及在代表其他部署的四个开源文本和图像数据集上,Snorkel相对于先前的启发式方法平均提高了预测性能,并且平均而言与大型人工精心策划的训练集的预测性能相差 。