Inria, HeKA, PariSanté Campus, Paris, France.
Centre de Recherche des Cordeliers, Inserm, Université Paris Cité, Sorbonne Université, France.
Stud Health Technol Inform. 2024 Aug 22;316:272-276. doi: 10.3233/SHTI240396.
The task of Named Entity Recognition (NER) is central for leveraging the content of clinical texts in observational studies. Indeed, texts contain a large part of the information available in Electronic Health Records (EHRs). However, clinical texts are highly heterogeneous between healthcare services and institutions, between countries and languages, making it hard to predict how existing tools may perform on a particular corpus. We compared four NER approaches on three French corpora and share our benchmarking pipeline in an open and easy-to-reuse manner, using the medkit Python library. We include in our pipelines fine-tuning operations with either one or several of the considered corpora. Our results illustrate the expected superiority of language models over a dictionary-based approach, and question the necessity of refining models already trained on biomedical texts. Beyond benchmarking, we believe sharing reusable and customizable pipelines for comparing fast-evolving Natural Language Processing (NLP) tools is a valuable contribution, since clinical texts themselves can hardly be shared for privacy concerns.
命名实体识别 (NER) 的任务对于利用观察性研究中临床文本的内容至关重要。事实上,文本包含电子健康记录 (EHR) 中可用信息的很大一部分。然而,临床文本在医疗保健服务和机构、国家和语言之间存在很大的异质性,因此很难预测现有工具在特定语料库上的表现如何。我们在三个法语语料库上比较了四种 NER 方法,并以开放且易于重用的方式共享我们的基准测试管道,使用 medkit Python 库。我们在管道中包括使用一个或多个考虑的语料库进行微调操作。我们的结果说明了语言模型相对于基于字典的方法的预期优势,并质疑对已经在生物医学文本上训练的模型进行细化的必要性。除了基准测试之外,我们还认为,共享可重复使用且可定制的用于比较快速发展的自然语言处理 (NLP) 工具的管道是一项有价值的贡献,因为出于隐私考虑,临床文本本身几乎无法共享。