文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

一种生物医学关系抽取训练语料库的混合方法:结合远程监督和众包。

A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing.

机构信息

LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal.

出版信息

Database (Oxford). 2020 Dec 1;2020. doi: 10.1093/database/baaa104.


DOI:10.1093/database/baaa104
PMID:33258966
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7706181/
Abstract

Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype-gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

摘要

生物医学关系抽取 (RE) 数据集对于知识库的构建和新交互发现至关重要。有几种方法可以创建生物医学 RE 数据集,有些比其他方法更可靠,例如依赖于领域专家的注释。然而,新兴的众包平台(如亚马逊 Mechanical Turk (MTurk))的使用,虽然不能保证相同的质量水平,但可以降低 RE 数据集构建的成本。研究人员缺乏控制谁、如何以及在什么情境下工人在众包平台上参与的能力。因此,将远程监督与众包结合使用可能是一种更可靠的替代方法。众包工人只需被要求纠正或丢弃已有的注释,这将使该过程较少依赖于他们理解复杂生物医学句子的能力。在这项工作中,我们使用先前创建的远程监督人类表型-基因关系 (PGR) 数据集来进行众包验证。我们将原始数据集分为两个注释任务:任务 1,由一名工人注释 70%的数据集;任务 2,由七名工人注释 30%的数据集。此外,对于任务 2,我们增加了一名现场额外评估者和一名领域专家,以进一步评估众包验证的质量。在这里,我们描述了一个详细的 RE 众包验证流程,创建了一个带有部分领域专家修订的新 PGR 数据集版本,并评估了 MTurk 平台的质量。我们将新数据集应用于两个最先进的深度学习系统 (BiOnt 和 BioBERT),并将其性能与原始 PGR 数据集以及两者的组合进行了比较,平均 F1 分数提高了 0.3494。支持我们工作的代码和新的 PGR 数据集版本可在 https://github.com/lasigeBioTM/PGR-crowd 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/677a39667d5a/baaa104f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/87c670c241b6/baaa104f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/6c206c8179e9/baaa104f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/294d965d5a09/baaa104f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/55c7a3525c6a/baaa104f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/0038d27b009c/baaa104f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/677a39667d5a/baaa104f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/87c670c241b6/baaa104f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/6c206c8179e9/baaa104f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/294d965d5a09/baaa104f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/55c7a3525c6a/baaa104f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/0038d27b009c/baaa104f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/7706181/677a39667d5a/baaa104f6.jpg

相似文献

[1]
A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing.

Database (Oxford). 2020-12-1

[2]
Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing.

J Med Internet Res. 2013-4-2

[3]
User-centered design of a web-based crowdsourcing-integrated semantic text annotation tool for building a mental health knowledge base.

J Biomed Inform. 2020-10

[4]
Microtask crowdsourcing for disease mention annotation in PubMed abstracts.

Pac Symp Biocomput. 2015

[5]
OC-2-KB: integrating crowdsourcing into an obesity and cancer knowledge base curation system.

BMC Med Inform Decis Mak. 2018-7-23

[6]
Identifying medical terms in patient-authored text: a crowdsourcing-based approach.

J Am Med Inform Assoc. 2013-5-5

[7]
A crowdsourcing workflow for extracting chemical-induced disease relations from free text.

Database (Oxford). 2016-4-17

[8]
Crowd control: Effectively utilizing unscreened crowd workers for biomedical data annotation.

J Biomed Inform. 2017-5

[9]
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

J Biomed Inform. 2023-10

[10]
Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd.

Pac Symp Biocomput. 2015

引用本文的文献

[1]
K-RET: knowledgeable biomedical relation extraction system.

Bioinformatics. 2023-4-3

[2]
COVID-19 recommender system based on an annotated multilingual corpus.

Genomics Inform. 2021-9

本文引用的文献

[1]
Improving accessibility and distinction between negative results in biomedical relation extraction.

Genomics Inform. 2020-6

[2]
Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase.

Database (Oxford). 2020-1-1

[3]
Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts.

Bioinformatics. 2020-2-15

[4]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020-2-15

[5]
BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies.

BMC Bioinformatics. 2019-1-7

[6]
A Crowdsourcing Framework for Medical Data Sets.

AMIA Jt Summits Transl Sci Proc. 2018-5-18

[7]
ComprehENotes, an Instrument to Assess Patient Reading Comprehension of Electronic Health Record Notes: Development and Validation.

J Med Internet Res. 2018-4-25

[8]
Comparing Amazon's Mechanical Turk Platform to Conventional Data Collection Methods in the Health and Medical Research Literature.

J Gen Intern Med. 2018-1-4

[9]
Foldit Standalone: a video game-derived protein structure manipulation interface using Rosetta.

Bioinformatics. 2017-9-1

[10]
Crowd control: Effectively utilizing unscreened crowd workers for biomedical data annotation.

J Biomed Inform. 2017-5

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索