Suppr超能文献

从自然语言中对ICD - 10医学实体进行神经翻译和自动识别:模型开发与性能评估

Neural Translation and Automated Recognition of ICD-10 Medical Entities From Natural Language: Model Development and Performance Assessment.

作者信息

Falissard Louis, Morgand Claire, Ghosn Walid, Imbaud Claire, Bounebache Karim, Rey Grégoire

机构信息

Centre for Epidemiology on Medical Causes of Death, Inserm, Le Kremlin Bicêtre, France.

出版信息

JMIR Med Inform. 2022 Apr 11;10(4):e26353. doi: 10.2196/26353.

Abstract

BACKGROUND

The recognition of medical entities from natural language is a ubiquitous problem in the medical field, with applications ranging from medical coding to the analysis of electronic health data for public health. It is, however, a complex task usually requiring human expert intervention, thus making it expansive and time-consuming. Recent advances in artificial intelligence, specifically the rise of deep learning methods, have enabled computers to make efficient decisions on a number of complex problems, with the notable example of neural sequence models and their powerful applications in natural language processing. However, they require a considerable amount of data to learn from, which is typically their main limiting factor. The Centre for Epidemiology on Medical Causes of Death (CépiDc) stores an exhaustive database of death certificates at the French national scale, amounting to several millions of natural language examples provided with their associated human-coded medical entities available to the machine learning practitioner.

OBJECTIVE

The aim of this paper was to investigate the application of deep neural sequence models to the problem of medical entity recognition from natural language.

METHODS

The investigated data set included every French death certificate from 2011 to 2016. These certificates contain information such as the subject's age, the subject's gender, and the chain of events leading to his or her death, both in French and encoded as International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) medical entities, for a total of around 3 million observations in the data set. The task of automatically recognizing ICD-10 medical entities from the French natural language-based chain of events leading to death was then formulated as a type of predictive modeling problem known as a sequence-to-sequence modeling problem. A deep neural network-based model, known as the Transformer, was then slightly adapted and fit to the data set. Its performance was then assessed on an external data set and compared to the current state-of-the-art approach. CIs for derived measurements were estimated via bootstrapping.

RESULTS

The proposed approach resulted in an F-measure value of 0.952 (95% CI 0.946-0.957), which constitutes a significant improvement over the current state-of-the-art approach and its previously reported F-measure value of 0.825 as assessed on a comparable data set. Such an improvement makes possible a whole field of new applications, from nosologist-level automated coding to temporal harmonization of death statistics.

CONCLUSIONS

This paper shows that a deep artificial neural network can directly learn from voluminous data sets in order to identify complex relationships between natural language and medical entities, without any explicit prior knowledge. Although not entirely free from mistakes, the derived model constitutes a powerful tool for automated coding of medical entities from medical language with promising potential applications.

摘要

背景

从自然语言中识别医学实体是医学领域普遍存在的问题,其应用范围涵盖从医学编码到用于公共卫生的电子健康数据分析等。然而,这是一项复杂的任务,通常需要人类专家干预,因此成本高昂且耗时。人工智能的最新进展,特别是深度学习方法的兴起,使计算机能够在许多复杂问题上做出高效决策,神经序列模型及其在自然语言处理中的强大应用就是显著例子。然而,它们需要大量数据来学习,这通常是其主要限制因素。法国国家死因流行病学中心(CépiDc)存储了法国全国范围内详尽的死亡证明数据库,为机器学习从业者提供了数百万个自然语言示例及其相关的人工编码医学实体。

目的

本文旨在研究深度神经序列模型在从自然语言中识别医学实体问题上的应用。

方法

所研究的数据集包括2011年至2016年的每一份法国死亡证明。这些证明包含诸如死者年龄、性别以及导致其死亡的事件链等信息,既有法语形式,也编码为《国际疾病和相关健康问题统计分类》第十次修订版(ICD - 10)医学实体,数据集中总共有约300万个观测值。从基于法语自然语言的导致死亡的事件链中自动识别ICD - 10医学实体的任务随后被表述为一种称为序列到序列建模问题的预测建模问题。然后对一种基于深度神经网络的模型(称为Transformer)进行了轻微调整并使其适应该数据集。随后在一个外部数据集上评估其性能,并与当前的最先进方法进行比较。通过自举法估计派生测量值的置信区间。

结果

所提出的方法得出的F值为0.952(95%置信区间0.946 - 0.957),与当前最先进方法相比有显著改进,在可比数据集上评估时,其先前报告的F值为0.825。这样的改进使一系列新应用成为可能,从疾病分类学家级别的自动编码到死亡统计的时间协调。

结论

本文表明深度人工神经网络可以直接从大量数据集中学习,以识别自然语言和医学实体之间的复杂关系,而无需任何明确的先验知识。尽管并非完全没有错误,但派生模型构成了从医学语言自动编码医学实体的强大工具,具有很有前景的潜在应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2938/9039820/b053423b220e/medinform_v10i4e26353_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验