Elvas Luis B, Almeida Ana, Ferreira João C
Department of Logistics, Molde University College, Molde 6410, Norway; Inov Inesc Inovação - Instituto de Novas Tecnologias, 1000-029 Lisbon, Portugal; Breast Cancer Research Program, Champalimaud Foundation, Lisbon, Portugal; ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026 Lisbon, Portugal.
ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026 Lisbon, Portugal.
Int J Med Inform. 2025 Dec;204:106049. doi: 10.1016/j.ijmedinf.2025.106049. Epub 2025 Jul 17.
The exponential growth of digitized medical data has created significant challenges for healthcare professionals, as medical documentation transitions from simple text records to complex, multi-dimensional data structures. Natural Language Processing (NLP), particularly Named Entity Recognition (NER), has emerged as a crucial tool for extracting and categorizing critical information from clinical texts. The development of transformer-based models like BERT and the ability to fine-tune pre-trained AI models have revolutionized the field, offering unprecedented opportunities to enhance the efficient and precise interpretation of medical data across diverse languages and healthcare contexts.
This literature review aimed to analyze recent NLP approaches for medical text processing, examining techniques, performance metrics, and advancements across different languages and healthcare contexts.
Following the Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) methodology, a scoping search was conducted in Scopus and PubMed databases, focusing on studies published between 2019-2024. The review included studies on language model fine-tuning and information extraction in healthcare, with a specific search query designed to capture relevant NLP techniques.
Of 67 initial records, 31 studies were ultimately included. Bidirectional Encoder Representations from Transformers (BERT)-based approaches, neural networks, and CRF/LSTM techniques dominated, consistently achieving F1-scores above 85 %. The studies covered multiple languages, with 51.5 % in English, 27.3 % in Chinese, and smaller representations in Italian, German, and Spanish. Hybrid approaches and techniques addressing data privacy and limited labeled data were notably prevalent.
The review revealed that modern NLP techniques, particularly BERT-based models and hybrid approaches, show significant promise in medical text processing across different languages. While challenges remain in cross-lingual adaptation and data availability, these technologies demonstrate potential to enhance medical data interpretation and analysis.