Department of Computer Science, University of Jaén, Campus Las Lagunillas, s/n, 23071, Jaén, Spain.
Natural Language Processing Unit, HT medica, Carmelo Torres, n°2, 23007, Jaén, Spain.
Comput Biol Med. 2023 Mar;154:106581. doi: 10.1016/j.compbiomed.2023.106581. Epub 2023 Jan 23.
This paper presents a new corpus of radiology medical reports written in Spanish and labeled with ICD-10. CARES (Corpus of Anonymised Radiological Evidences in Spanish) is a high-quality corpus manually labeled and reviewed by radiologists that is freely available for the research community on HuggingFace. These types of resources are essential for developing automatic text classification tools as they are necessary for training and tuning computational systems. However, in the medical domain these are very difficult to obtain for different reasons including privacy and data protection issues or the involvement of medical specialists in the generation of these resources. We present a corpus labeled and reviewed by radiologists in their daily practice that is available for research purposes. In addition, after describing the corpus and explaining how it has been generated, a first experimental approach is carried out using several machine learning algorithms based on transformer language models such as BioBERT and RoBERTa to test the validity of this linguistic resource. The best performing classifier achieved 0.8676 micro and 0.8328 macro f1-score and these results encourage us to continue working in this research line.
本文提出了一个新的西班牙语放射学医学报告语料库,并对其进行了 ICD-10 标注。CARES(西班牙语匿名放射学证据语料库)是一个由放射科医生手动标注和审核的高质量语料库,在 HuggingFace 上免费提供给研究社区使用。对于开发自动文本分类工具来说,这类资源是至关重要的,因为它们是训练和调整计算系统所必需的。然而,由于隐私和数据保护问题,或者医疗专家参与这些资源的生成等原因,在医疗领域中,这类资源是非常难以获取的。我们提供了一个由放射科医生在日常实践中进行标注和审核的语料库,可用于研究目的。此外,在描述了语料库并解释了其生成方式之后,我们使用了几种基于变压器语言模型的机器学习算法(如 BioBERT 和 RoBERTa)进行了初步的实验方法,以测试该语言资源的有效性。表现最好的分类器在微平均和宏平均 f1 得分上分别达到了 0.8676 和 0.8328,这些结果鼓励我们继续在这条研究路线上开展工作。