Bui Nhat, Nguyen Giang, Nguyen Nguyen, Vo Bao, Vo Luan, Huynh Tom, Tang Arthur, Tran Van Nhiem, Huynh Tuyen, Nguyen Huy Quang, Dinh Minh
School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam.
School of Science, Engineering and Technology, RMIT University, Ho Chi Minh City, Vietnam.
Comput Methods Programs Biomed. 2025 May;263:108655. doi: 10.1016/j.cmpb.2025.108655. Epub 2025 Feb 12.
The reported study illustrates a methodology for compiling training datasets to fine-tune Large Language Models (LLMs) for healthcare information in Vietnamese, a low-resource language. The objective is to bridge the gap in medical information accessibility and enhance healthcare communication in developing countries by adapting LLMs to specific linguistic nuances and domain needs.
The methodology involves selecting a base model, compiling a domain-specific dataset, and fine-tuning the model with this dataset. Three open-source models were selected. The dataset, comprising approximately 337,000 prompt-response pairs in Vietnamese, was compiled using existing datasets, data crawled from Vietnamese medical online forums, and distilled from Vietnamese medical textbooks. The three models were fine-tuned using the Low-Rank adaptation (LoRA) and Quantized Low-Rank adaptation (QLoRA) techniques. Models' performances were evaluated using BertScore score, Rouge-L score, and the "LLM-as-a-Judge" method.
The fine-tuned models showed enhancements in performance over their base versions across evaluation metrics in BertScore score, Rouge-L score and "LLM-as-a-Judge" method, confirming the effectiveness of the fine-tuning process. This study details the process of fine-tuning open-source LLMs for health information inquiries in Vietnamese, demonstrating its potential to improve healthcare communication in low-resource languages. Deploying the fine-tuned LLM on-premise enhances data privacy and security. However, the significant computing power and costs required pose challenges, especially for organizations in developing countries.
This case study highlights the unique challenges faced by developing countries using low-resource languages. Initiatives are needed to emphasize efforts to bridge healthcare gaps in underserved areas and contribute to global health equity.
所报道的研究阐述了一种用于编译训练数据集的方法,以针对越南语(一种资源匮乏的语言)中的医疗保健信息对大语言模型(LLMs)进行微调。目的是通过使大语言模型适应特定的语言细微差别和领域需求,弥合医疗信息获取方面的差距,并加强发展中国家的医疗保健沟通。
该方法包括选择一个基础模型、编译一个特定领域的数据集,并用此数据集对模型进行微调。选择了三个开源模型。该数据集由约33.7万个越南语的提示-响应对组成,使用现有数据集、从越南医学在线论坛爬取的数据以及从越南医学教科书中提炼的数据进行编译。使用低秩自适应(LoRA)和量化低秩自适应(QLoRA)技术对这三个模型进行微调。使用BertScore分数、Rouge-L分数和“LLM作为评判”方法评估模型的性能。
在BertScore分数、Rouge-L分数和“LLM作为评判”方法的评估指标中,微调后的模型在性能上比其基础版本有所提升,证实了微调过程的有效性。本研究详细介绍了针对越南语健康信息查询对开源大语言模型进行微调的过程,展示了其在改善资源匮乏语言的医疗保健沟通方面的潜力。在本地部署微调后的大语言模型可增强数据隐私和安全性。然而,所需的巨大计算能力和成本带来了挑战,特别是对发展中国家的组织而言。
本案例研究突出了使用资源匮乏语言的发展中国家面临的独特挑战。需要采取举措,强调努力弥合服务不足地区的医疗保健差距,并为全球健康公平做出贡献。