• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于药物警戒的制药、生物医学命名实体识别的语料库提供和特征描述:语言语域和训练数据充足性的评估。

Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency.

机构信息

Bayer AG, Pharmaceuticals, Medical Affairs & Pharmacovigilance, Data Science & Insights, Müllerstr. 170, 13353, Berlin, Germany.

Syncwork AG, Systems Development, Berlin, Germany.

出版信息

Drug Saf. 2023 Aug;46(8):765-779. doi: 10.1007/s40264-023-01322-3. Epub 2023 Jun 20.

DOI:10.1007/s40264-023-01322-3
PMID:37338799
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10345043/
Abstract

INTRODUCTION AND OBJECTIVE

Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance.

METHODS

A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance.

RESULTS

The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers.

CONCLUSIONS

A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level.

摘要

简介与目的

机器学习(ML)系统广泛应用于药物警戒中的自动实体识别。公开可用的数据集不允许独立使用注释实体,而是侧重于小实体子集或单一语言记录(非正式或科学语言)。本研究的目的是创建一个允许独立使用实体的数据集,探索预测性 ML 模型在不同记录上的性能,并介绍一种调查实体截止性能的方法。

方法

创建了一个结合不同记录的数据集,其中包含 18 种不同的实体。我们应用该数据集比较了集成模型与仅使用单一语言记录创建的模型的性能。我们引入了分数分层 k 折交叉验证,通过使用训练数据集的分数来确定实体级别的模型性能。我们通过使用训练数据集的分数来研究实体性能的变化,并评估实体峰值和截止性能。

结果

该数据集结合了 1400 条记录(科学语言:790 条;非正式语言:610 条)、2622 个句子和 9989 个实体出现,结合了外部(801 条记录)和内部来源(599 条记录)的数据。我们表明,与使用多种语言记录训练的集成模型相比,单一语言记录模型的性能较差。

结论

创建了一个具有各种不同药物和生物医学实体的手动注释数据集,并将其提供给研究社区。我们的结果表明,结合不同记录的模型提供了更好的可维护性、更高的鲁棒性,并且具有相似或更高的性能。分数分层 k 折交叉验证允许在实体级别评估训练数据的充分性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/673ed5d90e4d/40264_2023_1322_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/9c6c95f43aa1/40264_2023_1322_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/baa3ad3f3265/40264_2023_1322_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/36b73f36bd52/40264_2023_1322_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/9f4ab75ea9e0/40264_2023_1322_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/79f6db5bc116/40264_2023_1322_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/79b74294bda1/40264_2023_1322_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/673ed5d90e4d/40264_2023_1322_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/9c6c95f43aa1/40264_2023_1322_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/baa3ad3f3265/40264_2023_1322_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/36b73f36bd52/40264_2023_1322_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/9f4ab75ea9e0/40264_2023_1322_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/79f6db5bc116/40264_2023_1322_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/79b74294bda1/40264_2023_1322_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/673ed5d90e4d/40264_2023_1322_Fig7_HTML.jpg

相似文献

1
Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency.用于药物警戒的制药、生物医学命名实体识别的语料库提供和特征描述:语言语域和训练数据充足性的评估。
Drug Saf. 2023 Aug;46(8):765-779. doi: 10.1007/s40264-023-01322-3. Epub 2023 Jun 20.
2
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
3
Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。
Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.
4
A neural network multi-task learning approach to biomedical named entity recognition.一种用于生物医学命名实体识别的神经网络多任务学习方法。
BMC Bioinformatics. 2017 Aug 15;18(1):368. doi: 10.1186/s12859-017-1776-8.
5
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
6
Augmenting biomedical named entity recognition with general-domain resources.利用通用领域资源增强生物医学命名实体识别。
J Biomed Inform. 2024 Nov;159:104731. doi: 10.1016/j.jbi.2024.104731. Epub 2024 Oct 4.
7
Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training.基于词汇特征的 BiLSTM-CRF 和三训练的中药不良事件报告命名实体识别。
J Biomed Inform. 2019 Aug;96:103252. doi: 10.1016/j.jbi.2019.103252. Epub 2019 Jul 16.
8
Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization.分析迁移学习在生物医学跨语言命名实体识别和标准化中的影响。
BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):601. doi: 10.1186/s12859-021-04247-9.
9
On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.关于创建西班牙语临床金标准语料库:挖掘药物不良反应
J Biomed Inform. 2015 Aug;56:318-32. doi: 10.1016/j.jbi.2015.06.016. Epub 2015 Jun 30.
10
Named Entity Recognition in Pubmed Abstracts for Pharmacovigilance Using Deep Learning.基于深度学习的药物警戒中 PubMed 文摘命名实体识别。
Stud Health Technol Inform. 2022 May 25;294:878-879. doi: 10.3233/SHTI220615.

引用本文的文献

1
Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.大型语言模型在命名实体识别中的性能与可重复性:在受控环境中使用的考量
Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.

本文引用的文献

1
Industry Perspective on Artificial Intelligence/Machine Learning in Pharmacovigilance.药物警戒人工智能/机器学习的行业视角。
Drug Saf. 2022 May;45(5):439-448. doi: 10.1007/s40264-022-01164-5. Epub 2022 May 17.
2
Validating Intelligent Automation Systems in Pharmacovigilance: Insights from Good Manufacturing Practices.验证药物警戒中的智能自动化系统:良好生产规范的见解。
Drug Saf. 2021 Mar;44(3):261-272. doi: 10.1007/s40264-020-01030-2. Epub 2021 Feb 1.
3
Utilizing Advanced Technologies to Augment Pharmacovigilance Systems: Challenges and Opportunities.
利用先进技术增强药物警戒系统:挑战与机遇。
Ther Innov Regul Sci. 2020 Jul;54(4):888-899. doi: 10.1007/s43441-019-00023-3. Epub 2019 Dec 28.
4
Adverse Events in Twitter-Development of a Benchmark Reference Dataset: Results from IMI WEB-RADR.在 Twitter 中出现的不良反应:一个基准参考数据集的开发:来自 IMI WEB-RADR 的结果。
Drug Saf. 2020 May;43(5):467-478. doi: 10.1007/s40264-020-00912-9.
5
2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.2018n2c2 电子健康记录中药物不良反应和药物提取共享任务。
J Am Med Inform Assoc. 2020 Jan 1;27(1):3-12. doi: 10.1093/jamia/ocz166.
6
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
7
Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task.从 Twitter 上获取药物相关文本分类和概念规范化的数据和系统:来自社交媒体挖掘健康(SMM4H)-2017 共享任务的见解。
J Am Med Inform Assoc. 2018 Oct 1;25(10):1274-1283. doi: 10.1093/jamia/ocy114.
8
Cadec: A corpus of adverse drug event annotations.Cadec:一个药物不良事件注释语料库。
J Biomed Inform. 2015 Jun;55:73-81. doi: 10.1016/j.jbi.2015.03.010. Epub 2015 Mar 27.
9
Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.开发一个基准语料库,以支持从医疗病例报告中自动提取与药物相关的不良反应。
J Biomed Inform. 2012 Oct;45(5):885-92. doi: 10.1016/j.jbi.2012.04.008. Epub 2012 Apr 25.