ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal.
Select Data, Anaheim, CA, United States.
JMIR Med Inform. 2024 Oct 21;12:e60164. doi: 10.2196/60164.
In response to the intricate language, specialized terminology outside everyday life, and the frequent presence of abbreviations and acronyms inherent in health care text data, domain adaptation techniques have emerged as crucial to transformer-based models. This refinement in the knowledge of the language models (LMs) allows for a better understanding of the medical textual data, which results in an improvement in medical downstream tasks, such as information extraction (IE). We have identified a gap in the literature regarding health care LMs. Therefore, this study presents a scoping literature review investigating domain adaptation methods for transformers in health care, differentiating between English and non-English languages, focusing on Portuguese. Most specifically, we investigated the development of health care LMs, with the aim of comparing Portuguese with other more developed languages to guide the path of a non-English-language with fewer resources.
This study aimed to research health care IE models, regardless of language, to understand the efficacy of transformers and what are the medical entities most commonly extracted.
This scoping review was conducted using the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) methodology on Scopus and Web of Science Core Collection databases. Only studies that mentioned the creation of health care LMs or health care IE models were included, while large language models (LLMs) were excluded. The latest were not included since we wanted to research LMs and not LLMs, which are architecturally different and have distinct purposes.
Our search query retrieved 137 studies, 60 of which met the inclusion criteria, and none of them were systematic literature reviews. English and Chinese are the languages with the most health care LMs developed. These languages already have disease-specific LMs, while others only have general-health care LMs. European Portuguese does not have any public health care LM and should take examples from other languages to develop, first, general-health care LMs and then, in an advanced phase, disease-specific LMs. Regarding IE models, transformers were the most commonly used method, and named entity recognition was the most popular topic, with only a few studies mentioning Assertion Status or addressing medical lexical problems. The most extracted entities were diagnosis, posology, and symptoms.
The findings indicate that domain adaptation is beneficial, achieving better results in downstream tasks. Our analysis allowed us to understand that the use of transformers is more developed for the English and Chinese languages. European Portuguese lacks relevant studies and should draw examples from other non-English languages to develop these models and drive progress in AI. Health care professionals could benefit from highlighting medically relevant information and optimizing the reading of the textual data, or this information could be used to create patient medical timelines, allowing for profiling.
针对医学专业学术文献中复杂的语言、日常生活之外的专业术语,以及医疗文本数据中常见的缩写和首字母缩略词,基于转换器的模型需要采用领域适应技术。这种对语言模型(LM)的知识的细化,使得对医学文本数据的理解更好,从而提高医学下游任务的性能,如信息提取(IE)。我们发现文献中存在医疗领域 LM 的空白。因此,本研究进行了范围界定文献综述,调查了医疗领域转换器的领域适应方法,区分了英语和非英语语言,重点是葡萄牙语。具体来说,我们调查了医疗保健 LM 的发展,旨在将葡萄牙语与其他更发达的语言进行比较,为资源较少的非英语语言指明道路。
本研究旨在研究无论语言如何的医疗 IE 模型,以了解转换器的功效以及最常提取的医学实体。
本范围界定综述使用 PRISMA-ScR(系统评价和荟萃分析扩展的首选报告项目用于范围界定综述)方法,在 Scopus 和 Web of Science Core Collection 数据库上进行。仅包括提及创建医疗保健 LM 或医疗保健 IE 模型的研究,而排除大型语言模型(LLM)。未包括最新的研究,因为我们希望研究 LM 而不是 LLM,它们在架构上有所不同,并且具有不同的用途。
我们的搜索查询检索到 137 项研究,其中 60 项符合纳入标准,没有一项是系统文献综述。英语和中文是开发医疗保健 LM 最多的语言。这些语言已经有特定于疾病的 LM,而其他语言只有一般医疗保健 LM。欧洲葡萄牙语没有任何公共医疗保健 LM,应该从其他语言中吸取经验,首先开发一般医疗保健 LM,然后在高级阶段开发特定于疾病的 LM。在 IE 模型方面,转换器是最常用的方法,命名实体识别是最受欢迎的主题,只有少数研究提到断言状态或解决医学词汇问题。提取的最常见实体是诊断、剂量和症状。
研究结果表明,领域适应是有益的,可以在下游任务中取得更好的结果。我们的分析使我们能够了解到,英语和中文对转换器的使用更为成熟。欧洲葡萄牙语缺乏相关研究,应该从其他非英语语言中吸取经验来开发这些模型,推动人工智能的发展。医疗保健专业人员可以从突出医学相关信息和优化文本数据的阅读中受益,或者可以使用这些信息创建患者的医疗时间线,进行患者情况分析。