Snorkel：通过弱监督快速创建训练数据。

Snorkel: rapid training data creation with weak supervision.

作者信息

Ratner Alexander, Bach Stephen H, Ehrenberg Henry, Fries Jason, Wu Sen, Ré Christopher

机构信息

1Stanford University, Stanford, CA USA.

2Computer Science Department, Brown University, Providence, RI USA.

出版信息

VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.

DOI:10.1007/s00778-019-00552-1

PMID:32214778

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7075849/

Abstract

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models faster and increase predictive performance an average versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides average improvements to predictive performance over prior heuristic approaches and comes within an average of the predictive performance of large hand-curated training sets.

摘要

标注训练数据日益成为部署机器学习系统的最大瓶颈。我们推出了Snorkel，这是首个此类系统，它能让用户在无需手动标注任何训练数据的情况下训练最先进的模型。相反，用户编写表达任意启发式规则的标注函数，这些规则的准确性和相关性可能未知。Snorkel通过纳入我们最近提出的机器学习范式——数据编程的首个端到端实现，在无法获取真实标签的情况下对其输出进行去噪。基于我们过去一年与公司、机构和研究实验室合作的经验，我们展示了一个用于编写标注函数的灵活接口层。在一项用户研究中，主题专家构建模型的速度更快，与七小时的手动标注相比，预测性能平均有所提高。我们研究了这种新环境下的建模权衡，并提出了一种用于自动进行权衡决策的优化器，每次管道执行可实现高达的加速。在与美国退伍军人事务部和美国食品药品监督管理局的两次合作中，以及在代表其他部署的四个开源文本和图像数据集上，Snorkel相对于先前的启发式方法平均提高了预测性能，并且平均而言与大型人工精心策划的训练集的预测性能相差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b72/7075849/69b81ddc13c3/778_2019_552_Fig1_HTML.jpg

相似文献

Snorkel: rapid training data creation with weak supervision.Snorkel：通过弱监督快速创建训练数据。

VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.

Snorkel: Rapid Training Data Creation with Weak Supervision.Snorkel：通过弱监督快速创建训练数据

Proceedings VLDB Endowment. 2017 Nov;11(3):269-282. doi: 10.14778/3157794.3157797.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Sexual Harassment and Prevention Training性骚扰与预防培训

Short-Term Memory Impairment短期记忆障碍

Variation within and between digital pathology and light microscopy for the diagnosis of histopathology slides: blinded crossover comparison study.数字病理学与光学显微镜检查在组织病理学切片诊断中的内部及相互间差异：双盲交叉对比研究

Health Technol Assess. 2025 Jul;29(30):1-75. doi: 10.3310/SPLK4325.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.医疗专业人员在急症医院环境中团队合作教育的经验：对定性文献的系统综述

JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.

Anterior Approach Total Ankle Arthroplasty with Patient-Specific Cut Guides.使用患者特异性截骨导向器的前路全踝关节置换术。

JBJS Essent Surg Tech. 2025 Aug 15;15(3). doi: 10.2106/JBJS.ST.23.00027. eCollection 2025 Jul-Sep.

A systematic review of speech, language and communication interventions for children with Down syndrome from 0 to 6 years.对0至6岁唐氏综合征儿童言语、语言和沟通干预措施的系统评价。

Int J Lang Commun Disord. 2022 Mar;57(2):441-463. doi: 10.1111/1460-6984.12699. Epub 2022 Feb 22.

引用本文的文献

A machine-learning-driven data labeling pipeline for scientific analysis in .一种用于科学分析的机器学习驱动的数据标注流程。（原文中“in.”后面似乎缺少具体内容）

J Appl Crystallogr. 2025 May 12;58(Pt 3):731-745. doi: 10.1107/S1600576725002328. eCollection 2025 Jun 1.

Self-supervised learning for label-free segmentation in cardiac ultrasound.心脏超声中无标签分割的自监督学习

Nat Commun. 2025 Apr 30;16(1):4070. doi: 10.1038/s41467-025-59451-5.

Predicting survival in prospective clinical trials using weakly-supervised QSP.使用弱监督定量系统药理学预测前瞻性临床试验中的生存率。

NPJ Precis Oncol. 2025 Apr 14;9(1):106. doi: 10.1038/s41698-025-00898-6.

Emerging trends in SERS-based veterinary drug detection: multifunctional substrates and intelligent data approaches.基于表面增强拉曼光谱的兽药检测新趋势：多功能底物与智能数据方法

NPJ Sci Food. 2025 Mar 15;9(1):31. doi: 10.1038/s41538-025-00393-z.

BidCorpus: A multifaceted learning dataset for public procurement.投标语料库：一个用于公共采购的多方面学习数据集。

Data Brief. 2024 Dec 9;58:111202. doi: 10.1016/j.dib.2024.111202. eCollection 2025 Feb.

Inferring disease progression stages in single-cell transcriptomics using a weakly supervised deep learning approach.使用弱监督深度学习方法推断单细胞转录组学中的疾病进展阶段。

Genome Res. 2025 Jan 22;35(1):135-146. doi: 10.1101/gr.278812.123.

Detecting suicide risk among U.S. servicemembers and veterans: a deep learning approach using social media data.检测美国军人和退伍军人中的自杀风险：一种使用社交媒体数据的深度学习方法。

Psychol Med. 2024 Sep 9:1-10. doi: 10.1017/S0033291724001557.

The changing landscape of text mining: a review of approaches for ecology and evolution.文本挖掘的变化格局：对生态学和进化学方法的综述。

Proc Biol Sci. 2024 Jul;291(2027):20240423. doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.

Accurate single-molecule spot detection for image-based spatial transcriptomics with weakly supervised deep learning.基于弱监督深度学习的图像空间转录组学中单分子斑点的精确检测。

Cell Syst. 2024 May 15;15(5):475-482.e6. doi: 10.1016/j.cels.2024.04.006.

Scalable Approach to Consumer Wearable Postmarket Surveillance: Development and Validation Study.消费者可穿戴设备上市后监测的可扩展方法：开发与验证研究

JMIR Med Inform. 2024 Apr 4;12:e51171. doi: 10.2196/51171.

本文引用的文献

Cross-Modal Data Programming Enables Rapid Medical Machine Learning.跨模态数据编程助力快速医学机器学习。

Patterns (N Y). 2020 May 8;1(2). doi: 10.1016/j.patter.2020.100019. Epub 2020 Apr 28.

Snuba: Automating Weak Supervision to Label Training Data.Snuba：自动化弱监督以标记训练数据。

Proceedings VLDB Endowment. 2018 Nov;12(3):223-236. doi: 10.14778/3291264.3291268.

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.浮潜式干铃：工业规模部署弱监督的案例研究。

Proc ACM SIGMOD Int Conf Manag Data. 2019 Jun-Jul;2019:362-375. doi: 10.1145/3299869.3314036.

Training Complex Models with Multi-Task Weak Supervision.使用多任务弱监督训练复杂模型。

Proc AAAI Conf Artif Intell. 2019 Jan-Feb;33:4763-4771. doi: 10.1609/aaai.v33i01.33014763.

A machine-compiled database of genome-wide association studies.一个基于机器编译的全基因组关联研究数据库。

Nat Commun. 2019 Jul 26;10(1):3341. doi: 10.1038/s41467-019-11026-x.

Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences.使用未标记的心脏 MRI 序列进行主动脉瓣畸形的弱监督分类。

Nat Commun. 2019 Jul 15;10(1):3111. doi: 10.1038/s41467-019-11012-3.

Training Classifiers with Natural Language Explanations.使用自然语言解释训练分类器。

Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:1884-1895.

Snorkel MeTaL: Weak Supervision for Multi-Task Learning.Snorkel MeTaL：多任务学习的弱监督

Proc Second Workshop Data Manag End End Mach Learn (2018). 2018 Jun;2018. doi: 10.1145/3209889.3209898.

Learning the Structure of Generative Models without Labeled Data.在无标记数据的情况下学习生成模型的结构。

Proc Mach Learn Res. 2017 Aug;70:273-82.

Fonduer: Knowledge Base Construction from Richly Formatted Data.丰杜尔：从丰富格式数据构建知识库。

Proc ACM SIGMOD Int Conf Manag Data. 2018 Jun;2018:1301-1316. doi: 10.1145/3183713.3183729.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Snorkel：通过弱监督快速创建训练数据。

Snorkel: rapid training data creation with weak supervision.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献