Institut Pierre Louis d'Epidémiologie et de Santé Publique, Sorbonne Université, Inserm, 27 rue Chaligny, 75012 Paris, France; Département de médecine interne, APHP. Sorbonne Université, France.
Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé (LIMICS), 75006 Paris, France.
Artif Intell Med. 2022 Jun;128:102311. doi: 10.1016/j.artmed.2022.102311. Epub 2022 Apr 26.
The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English.
Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our study proposes a multilabel classification of clinical narratives, allowing us to automatically extract the main features of a patient report. Our system is an end-to-end pipeline from raw text to labels with two main steps: named entity recognition and multilabel classification. Both steps are based on a neural network architecture based on transformers. To train our final classifier, we extended the dataset with all English and French Unified Medical Language System (UMLS) vocabularies related to human diseases. We focus our study on the multilingualism of training resources and models, with experiments combining French and English in different ways (multilingual embeddings or translation).
We obtained an overall average micro-F1 score of 0.811 for the multilingual version, 0.807 for the French-only version and 0.797 for the translated version.
Our study proposes an original multilabel classification of French clinical notes for patient phenotyping. We show that a multilingual algorithm trained on annotated real clinical notes and UMLS vocabularies leads to the best results.
电子健康记录的发展提供了大量的非结构化生物医学信息。从这些数据中提取患者特征已成为一项主要挑战,尤其是在英语以外的语言中。
受我们参与的法国文本挖掘挑战赛(DEFT 2021)[1]的启发,我们的研究提出了一种临床叙述的多标签分类,使我们能够自动提取患者报告的主要特征。我们的系统是一个从原始文本到标签的端到端管道,有两个主要步骤:命名实体识别和多标签分类。这两个步骤都基于基于转换器的神经网络架构。为了训练我们的最终分类器,我们使用与人类疾病相关的所有英语和法语统一医学语言系统(UMLS)词汇扩展了数据集。我们专注于训练资源和模型的多语言化,通过以不同方式结合法语和英语的实验(多语言嵌入或翻译)。
我们获得了多语言版本的整体平均微-F1 得分为 0.811,法语版本为 0.807,翻译版本为 0.797。
我们的研究提出了一种用于患者表型分析的法语临床笔记的原始多标签分类。我们表明,在注释的真实临床笔记和 UMLS 词汇上训练的多语言算法可带来最佳结果。