HiTZ Basque Center for Language Technology Faculty of Engineering Bilbao University of the Basque Country (UPV/EHU), Spain(1).
IOMED Medical Solutions SL, Barcelona, Spain(2).
J Biomed Inform. 2023 Sep;145:104461. doi: 10.1016/j.jbi.2023.104461. Epub 2023 Aug 2.
Electronic Clinical Narratives (ECNs) store valuable individual's health information. However, there are few available open-source data. Besides, ECNs can be structurally heterogeneous, ranging from documents with explicit section headings or titles to unstructured notes. This lack of structure complicates building automatic systems and their evaluation.
The aim of the present work is to provide the scientific community with a Spanish open-source dataset to build and evaluate automatic section identification systems. Together with this dataset, the purpose is to design and implement a suitable evaluation measure and a fine-tuned language model adapted to the task.
A corpus of unstructured clinical records, in this case progress notes written in Spanish, was annotated with seven major section types. Existing metrics for the presented task were thoroughly assessed and, based on the most suitable one, we defined a new B2 metric better tailored given the task.
The annotated corpus, as well as the designed new evaluation script and a baseline model are freely available for the community. This model reaches an average B2 score of 71.3 on our open source dataset and an average B2 of 67.0 in data scarcity scenarios where the target corpus and its structure differs from the dataset used for training the LM.
Although section identification in unstructured clinical narratives is challenging, this work shows that it is possible to build competitive automatic systems when both data and the right evaluation metrics are available. The annotated data, the implemented evaluation scripts, and the section identification Language Model are open-sourced hoping that this contribution will foster the building of more and better systems.
电子临床叙事(Electronic Clinical Narratives,ECNs)存储了有价值的个人健康信息。然而,现有的开源数据却很少。此外,ECNs 可能在结构上存在差异,从具有明确节目标题或标题的文档到非结构化的笔记都有。这种缺乏结构的情况使得构建自动系统及其评估变得复杂。
本研究的目的是为科学界提供一个西班牙语开源数据集,用于构建和评估自动章节识别系统。除了这个数据集,我们的目的是设计和实施一种合适的评估指标和经过微调的语言模型,以适应任务需求。
我们对一组非结构化的临床记录进行了注释,这些记录是以西班牙语书写的进度笔记,标注了七种主要的章节类型。我们对现有的用于该任务的指标进行了全面评估,并根据最适合的指标,定义了一个新的 B2 指标,该指标更适合给定的任务。
标注的语料库,以及新设计的评估脚本和基线模型,都可供社区免费使用。该模型在我们的开源数据集上平均 B2 得分为 71.3,在数据稀缺的情况下,即目标语料库及其结构与用于训练语言模型的数据集不同时,平均 B2 得分为 67.0。
尽管在非结构化临床叙事中进行章节识别具有挑战性,但这项工作表明,当有数据和正确的评估指标时,构建有竞争力的自动系统是可行的。标注的数据、实施的评估脚本和章节识别语言模型都已开源,希望这一贡献能够促进更多更好的系统的建立。