Suppr超能文献

为多站点意大利语医学自然语言数据项目定义预处理管道。

Defining a Preprocessing Pipeline for the MULTI-SITA Project and General Medical Italian Natural Language Data.

作者信息

Cappello Alice, Mora Sara, Giacobbe Daniele Roberto, Bassetti Matteo, Giacomini Mauro

机构信息

Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy.

Department of Informatics, Bioengineering, Robotics and System Engineering, University of Genoa, Genoa, Italy.

出版信息

Stud Health Technol Inform. 2023 Oct 20;309:48-52. doi: 10.3233/SHTI230737.

Abstract

The application of Natural Language Processing (NLP) to medical data has revolutionized different aspects of health care. The benefits obtained from the implementation of this technique spill over into several areas, including in the implementation of chatbots, which can provide medical assistance remotely. Every possible application of NLP depends on one first main step: the pre-processing of the corpus retrieved. The raw data must be prepared with the aim to be used efficiently for further analysis. Considerable progress has been made in this direction for the English language but for other languages, such as Italian, the state of the art is not equivalently advanced, especially for texts containing technical medical terms. The aim of this work is to identify and develop a preprocessing pipeline suitable for medical data written in Italian. The pipeline has been developed in Python environment, employing Enchant, ntlk modules and Hugging Face's BERT and BART-based models. Then, it has been tested on real conversations typed between patients and physicians regarding medical questions. The algorithm has been developed within the MULTI-SITA project of the Italian Society of Anti-Infective Therapy (SITA), but shows a flexible structure that can adapt to a large variety of data.

摘要

自然语言处理(NLP)在医学数据中的应用彻底改变了医疗保健的各个方面。实施这项技术所带来的好处延伸到多个领域,包括聊天机器人的应用,其可以远程提供医疗帮助。NLP的每个可能应用都依赖于一个首要的主要步骤:对检索到的语料库进行预处理。必须对原始数据进行准备,以便有效地用于进一步分析。在英语方面,已经在这个方向上取得了相当大的进展,但对于其他语言,如意大利语,目前的技术水平并没有同等程度的进步,特别是对于包含医学技术术语的文本。这项工作的目的是识别并开发一个适用于意大利语撰写的医学数据的预处理管道。该管道是在Python环境中开发的,使用了Enchant、ntlk模块以及基于Hugging Face的BERT和BART的模型。然后,它在患者与医生之间关于医疗问题的真实对话记录上进行了测试。该算法是在意大利抗感染治疗协会(SITA)的MULTI-SITA项目中开发的,但显示出一种可以适应多种数据的灵活结构。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验