Department of Languages and Computer Systems. IXA Research Group: http://ixa.eus. University of the Basque Country (UPV-EHU), Leioa, Spain.
Int J Med Inform. 2019 Sep;129:49-59. doi: 10.1016/j.ijmedinf.2019.05.015. Epub 2019 May 22.
Automatic extraction of morbid disease or conditions contained in Death Certificates is a critical process, useful for billing, epidemiological studies and comparison across countries. The fact that these clinical documents are written in regular natural language makes the automatic coding process difficult because, often, spontaneous terms diverge strongly from standard reference terminology such as the International Classification of Diseases (ICD).
Our aim is to propose a general and multilingual approach to render Diagnostic Terms into the standard framework provided by the ICD. We have evaluated our proposal on a set of clinical texts written in French, Hungarian and Italian.
ICD-10 encoding is a multi-class classification problem with an extensive (thousands) number of classes. After considering several approaches, we tackle our objective as a sequence-to-sequence task. According to current trends, we opted to use neural networks. We tested different types of neural architectures on three datasets in which Diagnostic Terms (DTs) have their ICD-10 codes associated.
Our results give a new state-of-the art on multilingual ICD-10 coding, outperforming several alternative approaches, and showing the feasibility of automatic ICD-10 prediction obtaining an F-measure of 0.838, 0.963 and 0.952 for French, Hungarian and Italian, respectively. Additionally, the results are interpretable, providing experts with supporting evidence when confronted with coding decisions, as the model is able to show the alignments between the original text and each output code.
从死亡证明中自动提取包含的病态疾病或情况是一个关键过程,对于计费、流行病学研究和国家间比较都非常有用。这些临床文档是用常规自然语言书写的,这使得自动编码过程变得困难,因为自发术语通常与国际疾病分类(ICD)等标准参考术语有很大的差异。
我们的目标是提出一种通用的多语言方法,将诊断术语转换为 ICD 提供的标准框架。我们已经在一组用法语、匈牙利语和意大利语书写的临床文本上评估了我们的提案。
ICD-10 编码是一个多类分类问题,有数千个类。在考虑了几种方法之后,我们将目标视为序列到序列任务。根据当前的趋势,我们选择使用神经网络。我们在三个数据集上测试了不同类型的神经网络架构,其中每个数据集都将诊断术语(DT)与其 ICD-10 代码相关联。
我们的结果在多语言 ICD-10 编码方面取得了新的最新水平,优于几种替代方法,并展示了自动 ICD-10 预测的可行性,在法语、匈牙利语和意大利语上的 F 度量分别为 0.838、0.963 和 0.952。此外,结果是可解释的,为专家在面对编码决策时提供了支持证据,因为模型能够显示原始文本和每个输出代码之间的对齐方式。