Suppr超能文献

用于药物警戒的制药、生物医学命名实体识别的语料库提供和特征描述:语言语域和训练数据充足性的评估。

Provision and Characterization of a Corpus for Pharmaceutical, Biomedical Named Entity Recognition for Pharmacovigilance: Evaluation of Language Registers and Training Data Sufficiency.

机构信息

Bayer AG, Pharmaceuticals, Medical Affairs & Pharmacovigilance, Data Science & Insights, Müllerstr. 170, 13353, Berlin, Germany.

Syncwork AG, Systems Development, Berlin, Germany.

出版信息

Drug Saf. 2023 Aug;46(8):765-779. doi: 10.1007/s40264-023-01322-3. Epub 2023 Jun 20.

Abstract

INTRODUCTION AND OBJECTIVE

Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance.

METHODS

A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance.

RESULTS

The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers.

CONCLUSIONS

A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level.

摘要

简介与目的

机器学习(ML)系统广泛应用于药物警戒中的自动实体识别。公开可用的数据集不允许独立使用注释实体,而是侧重于小实体子集或单一语言记录(非正式或科学语言)。本研究的目的是创建一个允许独立使用实体的数据集,探索预测性 ML 模型在不同记录上的性能,并介绍一种调查实体截止性能的方法。

方法

创建了一个结合不同记录的数据集,其中包含 18 种不同的实体。我们应用该数据集比较了集成模型与仅使用单一语言记录创建的模型的性能。我们引入了分数分层 k 折交叉验证,通过使用训练数据集的分数来确定实体级别的模型性能。我们通过使用训练数据集的分数来研究实体性能的变化,并评估实体峰值和截止性能。

结果

该数据集结合了 1400 条记录(科学语言:790 条;非正式语言:610 条)、2622 个句子和 9989 个实体出现,结合了外部(801 条记录)和内部来源(599 条记录)的数据。我们表明,与使用多种语言记录训练的集成模型相比,单一语言记录模型的性能较差。

结论

创建了一个具有各种不同药物和生物医学实体的手动注释数据集,并将其提供给研究社区。我们的结果表明,结合不同记录的模型提供了更好的可维护性、更高的鲁棒性,并且具有相似或更高的性能。分数分层 k 折交叉验证允许在实体级别评估训练数据的充分性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a66d/10345043/9c6c95f43aa1/40264_2023_1322_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验