Kartchner David, Nakajima An Davi, Ren Wendi, Zhang Chao, Mitchell Cassie S
Laboratory for Pathology Dynamics, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.
School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA.
Artif Intell. 2022 Mar;3(1):211-228. doi: 10.3390/ai3010013. Epub 2022 Mar 16.
A major bottleneck preventing the extension of deep learning systems to new domains is the prohibitive cost of acquiring sufficient training labels. Alternatives such as weak supervision, active learning, and fine-tuning of pretrained models reduce this burden but require substantial human input to select a highly informative subset of instances or to curate labeling functions. REGAL (Rule-Enhanced Generative Active Learning) is an improved framework for weakly supervised text classification that performs active learning over labeling functions rather than individual instances. REGAL interactively creates high-quality labeling patterns from raw text, enabling a single annotator to accurately label an entire dataset after initialization with three keywords for each class. Experiments demonstrate that REGAL extracts up to 3 times as many high-accuracy labeling functions from text as current state-of-the-art methods for interactive weak supervision, enabling REGAL to dramatically reduce the annotation burden of writing labeling functions for weak supervision. Statistical analysis reveals REGAL performs equal or significantly better than interactive weak supervision for five of six commonly used natural language processing (NLP) baseline datasets.
阻碍深度学习系统扩展到新领域的一个主要瓶颈是获取足够训练标签的高昂成本。诸如弱监督、主动学习和预训练模型微调等替代方法减轻了这一负担,但需要大量人力投入来选择信息量丰富的实例子集或精心设计标注函数。REGAL(规则增强生成式主动学习)是一种用于弱监督文本分类的改进框架,它对标注函数而非单个实例执行主动学习。REGAL从原始文本中交互式地创建高质量的标注模式,使得单个注释者在为每个类别初始化三个关键词后能够准确地标注整个数据集。实验表明,与当前用于交互式弱监督的最先进方法相比,REGAL从文本中提取的高精度标注函数数量多达三倍,这使得REGAL能够显著减轻为弱监督编写标注函数的注释负担。统计分析表明,在六个常用的自然语言处理(NLP)基线数据集中的五个上,REGAL的表现与交互式弱监督相当或显著更好。