Wei Yishu, Wang Xindi, Ong Hanley, Zhou Yiliang, Flanders Adam, Shih George, Peng Yifan
Department of Population Health Sciences, Weill Cornell Medicine, New York.
Department of Radiology, Weill Cornell Medicine, New York.
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:614-623. eCollection 2025.
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential offine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
尽管在将大语言模型(LLMs)应用于医学领域方面取得了重大进展,但仍有一些限制因素阻碍它们的实际应用。其中包括模型规模的限制以及缺乏特定队列的标注数据集。在这项工作中,我们研究了通过使用合成标签的数据集对轻量级大语言模型(如Llama 3.1 - 8B)进行微调来提升其性能的潜力。通过合并各自的指令数据集来联合训练两个任务。当特定任务的合成标签质量相对较高时(例如由GPT4 - o生成),Llama 3.1 - 8B在开放式疾病检测任务上取得了令人满意的性能,微F1分数为0.91。相反,当与任务相关的合成标签质量相对较低时(例如来自MIMIC - CXR数据集),经过微调的Llama 3.1 - 8B在根据精心策划的标签进行校准后,能够超越其有噪声的教师标签(微F1分数分别为0.67和0.63),这表明该模型具有强大的内在潜在能力。这些发现证明了使用合成标签对大语言模型进行微调的潜力,为医学领域大语言模型专业化的未来研究提供了一个有前景的方向。