利用无监督学习和正例无标签学习促进信息提取，而无需使用标注数据。

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning.

机构信息

Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA.

Harvard Medical School, Boston, MA.

出版信息

AMIA Annu Symp Proc. 2021 Jan 25;2020:658-667. eCollection 2020.

PMID:33936440

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8075513/

Abstract

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.

摘要

信息抽取（IE），即从非结构化数据中提取特定信息，是自然语言处理的核心任务。对于罕见实体（<1%的患病率），为了训练模型而需要收集的阳性示例可能需要大量的主要为阴性的示例。我们结合了无监督和有偏的阳性未标记（PU）学习方法来：1）促进阳性示例的收集，同时保持从有偏的阳性未标记数据中学习二分类器所需的假设。我们在一个真实的罕见（<0.42%）实体提取的医疗事故文档的用例中测试了这些方法。在对人工审阅的随机文档样本进行测试时，PU 模型在精度-召回曲线下的面积达到 0.283，Fj 值达到 0.410，优于完全监督学习（分别为 0.022 和 0.096）。结果表明，我们的方法有潜力减少从叙述性文本中提取罕见实体所需的人工工作量。

相似文献

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning.利用无监督学习和正例无标签学习促进信息提取，而无需使用标注数据。

AMIA Annu Symp Proc. 2021 Jan 25;2020:658-667. eCollection 2020.

Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.结合命名实体识别和未知词处理的本体事件抽取的主动学习

J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.

Supervised methods to extract clinical events from cardiology reports in Italian.从意大利语的心脏病学报告中提取临床事件的有监督方法。

J Biomed Inform. 2019 Jul;95:103219. doi: 10.1016/j.jbi.2019.103219. Epub 2019 May 28.

Unsupervised inference of implicit biomedical events using context triggers.使用上下文触发器进行无监督的隐含生物医学事件推断。

BMC Bioinformatics. 2020 Jan 28;21(1):29. doi: 10.1186/s12859-020-3341-0.

Entity linking for biomedical literature.生物医学文献的实体链接

BMC Med Inform Decis Mak. 2015;15 Suppl 1(Suppl 1):S4. doi: 10.1186/1472-6947-15-S1-S4. Epub 2015 May 20.

Extracting important information from Chinese Operation Notes with natural language processing methods.运用自然语言处理方法从中文手术记录中提取重要信息。

J Biomed Inform. 2014 Apr;48:130-6. doi: 10.1016/j.jbi.2013.12.017. Epub 2014 Jan 31.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Domain transformation on biological event extraction by learning methods.通过学习方法进行生物事件抽取的领域转换。

J Biomed Inform. 2019 Jul;95:103236. doi: 10.1016/j.jbi.2019.103236. Epub 2019 Jun 18.

Using distant supervision to augment manually annotated data for relation extraction.利用远程监督来扩充人工标注数据以进行关系抽取。

PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.

Extracting biomedical events from pairs of text entities.从文本实体对中提取生物医学事件。

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S8. doi: 10.1186/1471-2105-16-S10-S8. Epub 2015 Jul 13.

本文引用的文献

Mining clinical phrases from nursing notes to discover risk factors of patient deterioration.从护理记录中挖掘临床短语，以发现患者恶化的风险因素。

Int J Med Inform. 2020 Mar;135:104053. doi: 10.1016/j.ijmedinf.2019.104053. Epub 2019 Dec 14.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable.基于 Trie 的规则处理在临床自然语言处理中的应用：n-trie 的使用案例研究，使 ConText 算法更高效、更具可扩展性。

J Biomed Inform. 2018 Sep;85:106-113. doi: 10.1016/j.jbi.2018.08.002. Epub 2018 Aug 6.

Using Medical Text Extraction, Reasoning and Mapping System (MTERMS) to process medication information in outpatient clinical notes.使用医学文本提取、推理与映射系统（MTERMS）处理门诊临床记录中的用药信息。

AMIA Annu Symp Proc. 2011;2011:1639-48. Epub 2011 Oct 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验