• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

正则表达式的主动学习在实体抽取中的应用。

Active Learning of Regular Expressions for Entity Extraction.

出版信息

IEEE Trans Cybern. 2018 Mar;48(3):1067-1080. doi: 10.1109/TCYB.2017.2680466. Epub 2017 Mar 24.

DOI:10.1109/TCYB.2017.2680466
PMID:28358694
Abstract

We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this paper, we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with genetic programming (GP) and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with GP and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters, and execution time over a state-of-the-art baseline.

摘要

我们考虑从非结构化文本流中示例的所需提取中,以正则表达式的形式自动合成实体提取器。这是一个长期存在的问题,已经提出了许多不同的方法,这些方法都需要用户预先构建一个完全注释的大型数据集。在本文中,我们提出了一种主动学习方法,旨在最大限度地减少用户的注释工作:用户只需注释一个所需的提取,然后只需回答系统生成的提取查询。在学习过程中,系统会深入输入文本,以选择最合适的提取查询提交给用户,以改进当前的提取器。我们使用遗传编程 (GP) 构建候选解决方案,并使用委员会查询的形式选择查询,即基于最佳候选解决方案内的不一致性度量。我们系统的所有组件都经过精心设计,以适应具有 GP 的主动学习和从非结构化文本中提取实体的特点。我们深入评估了我们的提案,在多个具有挑战性的数据集上,并根据回答每个查询所涉及的用户工作量的实际估计。结果表明,与最先进的基线相比,在计算工作量、注释字符和执行时间方面具有很高的准确性,并且具有显著的节省。

相似文献

1
Active Learning of Regular Expressions for Entity Extraction.正则表达式的主动学习在实体抽取中的应用。
IEEE Trans Cybern. 2018 Mar;48(3):1067-1080. doi: 10.1109/TCYB.2017.2680466. Epub 2017 Mar 24.
2
Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.结合命名实体识别和未知词处理的本体事件抽取的主动学习
J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.
3
Visually defining and querying consistent multi-granular clinical temporal abstractions.直观定义和查询一致的多粒度临床时间抽象。
Artif Intell Med. 2012 Feb;54(2):75-101. doi: 10.1016/j.artmed.2011.10.004. Epub 2011 Dec 15.
4
Support patient search on pathology reports with interactive online learning based data extraction.通过基于交互式在线学习的数据提取来支持对病理报告的患者搜索。
J Pathol Inform. 2015 Sep 28;6:51. doi: 10.4103/2153-3539.166012. eCollection 2015.
5
Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts.利用词嵌入和医学实体提取,通过非结构化文本检索生物医学数据集。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax091.
6
BioFed: federated query processing over life sciences linked open data.BioFed:基于生命科学关联开放数据的联邦查询处理
J Biomed Semantics. 2017 Mar 15;8(1):13. doi: 10.1186/s13326-017-0118-0.
7
Query-oriented evidence extraction to support evidence-based medicine practice.面向查询的证据提取以支持循证医学实践。
J Biomed Inform. 2016 Feb;59:169-84. doi: 10.1016/j.jbi.2015.11.010. Epub 2015 Dec 2.
8
Automatic Search-and-Replace From Examples With Coevolutionary Genetic Programming.
IEEE Trans Cybern. 2021 May;51(5):2612-2624. doi: 10.1109/TCYB.2019.2918337. Epub 2021 Apr 15.
9
A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora.面向文本语料概念化的概念驱动生物医学知识提取和可视化框架。
J Biomed Inform. 2010 Dec;43(6):1020-35. doi: 10.1016/j.jbi.2010.09.008. Epub 2010 Sep 24.
10
Active Learning by Querying Informative and Representative Examples.主动学习通过查询信息丰富且具有代表性的示例。
IEEE Trans Pattern Anal Mach Intell. 2014 Oct;36(10):1936-49. doi: 10.1109/TPAMI.2014.2307881.

引用本文的文献

1
RegEMR: a natural language processing system to automatically identify premature ovarian decline from Chinese electronic medical records.RegEMR:一个自然语言处理系统,用于从中文电子病历中自动识别卵巢早衰。
BMC Med Inform Decis Mak. 2023 Jul 18;23(1):126. doi: 10.1186/s12911-023-02239-8.
2
Extraction of Ejection Fraction from Echocardiography Notes for Constructing a Cohort of Patients having Heart Failure with reduced Ejection Fraction (HFrEF).从超声心动图记录中提取射血分数,以构建射血分数降低的心力衰竭(HFrEF)患者队列。
J Med Syst. 2018 Sep 25;42(11):209. doi: 10.1007/s10916-018-1066-7.