利用远程监督来扩充人工标注数据以进行关系抽取。

Using distant supervision to augment manually annotated data for relation extraction.

机构信息

Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America.

Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America.

出版信息

PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.

DOI:10.1371/journal.pone.0216913

PMID:31361753

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6667146/

Abstract

Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

摘要

最近，在自然语言处理任务中应用深度学习已经取得了重大进展。然而，深度学习模型通常需要大量的标注训练数据，而对于生物医学文献中的许多自然语言处理任务，通常只有少量的标记数据集。由于构建大规模数据集需要大量的人力投入，并且通常需要在专门领域具备专业知识，因此为深度学习构建大规模数据集是昂贵的。在这项工作中，我们考虑使用远程监督来扩充手动标注数据。然而，远程监督获得的数据通常是嘈杂的，我们首先应用一些启发式方法来删除一些错误的标注。然后，我们使用受迁移学习启发的方法表明，所得到的模型优于在原始手动标注集上训练的模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f4c/6667146/876064556cf0/pone.0216913.g001.jpg

相似文献

Using distant supervision to augment manually annotated data for relation extraction.利用远程监督来扩充人工标注数据以进行关系抽取。

PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.

Identification of asthma control factor in clinical notes using a hybrid deep learning model.使用混合深度学习模型从临床记录中识别哮喘控制因素。

BMC Med Inform Decis Mak. 2021 Nov 9;21(Suppl 7):272. doi: 10.1186/s12911-021-01633-4.

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

Comparison of radiologist versus natural language processing-based image annotations for deep learning system for tuberculosis screening on chest radiographs.比较放射科医生与基于自然语言处理的图像标注对胸部 X 光片结核病筛查深度学习系统的影响。

Clin Imaging. 2022 Jul;87:34-37. doi: 10.1016/j.clinimag.2022.04.009. Epub 2022 Apr 25.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.自动去除法国电子健康记录中的标识符：一种利用远程监督和深度学习模型的具有成本效益的方法。

BMC Med Inform Decis Mak. 2024 Feb 16;24(1):54. doi: 10.1186/s12911-024-02422-5.

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.使用本体论对表型进行注释：自然语言处理系统的培训和评估的黄金标准。

Database (Oxford). 2018 Jan 1;2018:bay110. doi: 10.1093/database/bay110.

Syntax-based transfer learning for the task of biomedical relation extraction.基于语法的迁移学习在生物医学关系抽取任务中的应用。

J Biomed Semantics. 2021 Aug 18;12(1):16. doi: 10.1186/s13326-021-00248-y.

Domain transformation on biological event extraction by learning methods.通过学习方法进行生物事件抽取的领域转换。

J Biomed Inform. 2019 Jul;95:103236. doi: 10.1016/j.jbi.2019.103236. Epub 2019 Jun 18.

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning.利用无监督学习和正例无标签学习促进信息提取，而无需使用标注数据。

AMIA Annu Symp Proc. 2021 Jan 25;2020:658-667. eCollection 2020.

Learning to explain is a good biomedical few-shot learner.学会解释是一个很好的生物医学小样本学习者。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae589.

引用本文的文献

Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation extraction.探讨改进 BERT 模型在生物医学关系抽取中的预训练和微调。

BMC Bioinformatics. 2022 Apr 4;23(1):120. doi: 10.1186/s12859-022-04642-w.

Identification of asthma control factor in clinical notes using a hybrid deep learning model.使用混合深度学习模型从临床记录中识别哮喘控制因素。

BMC Med Inform Decis Mak. 2021 Nov 9;21(Suppl 7):272. doi: 10.1186/s12911-021-01633-4.

Deep Learning Identification of Asthma Inhaler Techniques in Clinical Notes.通过深度学习从临床记录中识别哮喘吸入器使用技术

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2020;2020. doi: 10.1109/bibm49941.2020.9313224. Epub 2021 Jan 13.

UMLS-based data augmentation for natural language processing of clinical research literature.基于 UMLS 的临床研究文献自然语言处理的数据增强。

J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.

本文引用的文献

LocText: relation extraction of protein localizations to assist database curation.蛋白质定位的关系提取以辅助数据库编纂。

BMC Bioinformatics. 2018 Jan 17;19(1):15. doi: 10.1186/s12859-018-2021-9.

Extracting microRNA-gene relations from biomedical literature using distant supervision.利用远程监督从生物医学文献中提取微小RNA-基因关系。

PLoS One. 2017 Mar 6;12(3):e0171929. doi: 10.1371/journal.pone.0171929. eCollection 2017.

A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein Relation Extraction.基于最短依赖路径的卷积神经网络在蛋白质-蛋白质关系抽取中的应用。

Biomed Res Int. 2016;2016:8479587. doi: 10.1155/2016/8479587. Epub 2016 Jul 14.

GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.GNormPlus：一种用于标记基因、基因家族和蛋白质结构域的综合方法。

Biomed Res Int. 2015;2015:918710. doi: 10.1155/2015/918710. Epub 2015 Aug 25.

Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles.利用远程监督学习从全文科学文章中识别蛋白质亚细胞定位。

J Biomed Inform. 2015 Oct;57:134-44. doi: 10.1016/j.jbi.2015.07.013. Epub 2015 Jul 26.

Deep learning.深度学习。

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.tagtog：在 PLoS 全文文章中进行基因提及的交互式和文本挖掘辅助注释。

Database (Oxford). 2014 Apr 7;2014(0):bau033. doi: 10.1093/database/bau033. Print 2014.

The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases.MIntAct 项目——将 IntAct 作为 11 个分子相互作用数据库的通用协同策展平台。

Nucleic Acids Res. 2014 Jan;42(Database issue):D358-63. doi: 10.1093/nar/gkt1115. Epub 2013 Nov 13.

A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.从文献中提取蛋白质-蛋白质相互作用的核方法综合基准测试

PLoS Comput Biol. 2010 Jul 1;6(7):e1000837. doi: 10.1371/journal.pcbi.1000837.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.生物创意II蛋白质-蛋白质相互作用注释提取任务概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用远程监督来扩充人工标注数据以进行关系抽取。

Using distant supervision to augment manually annotated data for relation extraction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献