IEEE Trans Cybern. 2018 Mar;48(3):1067-1080. doi: 10.1109/TCYB.2017.2680466. Epub 2017 Mar 24.
We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this paper, we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with genetic programming (GP) and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with GP and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters, and execution time over a state-of-the-art baseline.
我们考虑从非结构化文本流中示例的所需提取中,以正则表达式的形式自动合成实体提取器。这是一个长期存在的问题,已经提出了许多不同的方法,这些方法都需要用户预先构建一个完全注释的大型数据集。在本文中,我们提出了一种主动学习方法,旨在最大限度地减少用户的注释工作:用户只需注释一个所需的提取,然后只需回答系统生成的提取查询。在学习过程中,系统会深入输入文本,以选择最合适的提取查询提交给用户,以改进当前的提取器。我们使用遗传编程 (GP) 构建候选解决方案,并使用委员会查询的形式选择查询,即基于最佳候选解决方案内的不一致性度量。我们系统的所有组件都经过精心设计,以适应具有 GP 的主动学习和从非结构化文本中提取实体的特点。我们深入评估了我们的提案,在多个具有挑战性的数据集上,并根据回答每个查询所涉及的用户工作量的实际估计。结果表明,与最先进的基线相比,在计算工作量、注释字符和执行时间方面具有很高的准确性,并且具有显著的节省。