Haghighi Tania, Gholami Sina, Sokol Jared Todd, Kishnani Enaika, Ahsaniyan Adnan, Rahmanian Holakou, Hedayati Fares, Leng Theodore, Alam Minhaj Nur
Department of Electrical Engineering, University of North Carolina at Charlotte, Charlotte, NC, United States.
Department of Computer Science, Baha'i Institute for Higher Education, Tehran, Iran.
bioRxiv. 2024 Apr 29:2024.04.26.591355. doi: 10.1101/2024.04.26.591355.
Training Large Language Models (LLMs) with in-domain data can significantly enhance their performance, leading to more accurate and reliable question-answering (QA) systems essential for supporting clinical decision-making and educating patients.
This study introduces LLMs trained on in-domain, well-curated ophthalmic datasets. We also present an open-source substantial ophthalmic language dataset for model training. Our LLMs (EYE-Llama), first pre-trained on an ophthalmology-specific dataset, including paper abstracts, textbooks, EyeWiki, and Wikipedia articles. Subsequently, the models underwent fine-tuning using a diverse range of QA datasets. The LLMs at each stage were then compared to baseline Llama 2, ChatDoctor, and ChatGPT (GPT3.5) models, using four distinct test sets, and evaluated quantitatively (Accuracy, F1 score, and BERTScore) and qualitatively by two ophthalmologists.
Upon evaluating the models using the American Academy of Ophthalmology (AAO) test set and BERTScore as the metric, our models surpassed both Llama 2 and ChatDoctor in terms of F1 score and performed equally to ChatGPT, which was trained with 175 billion parameters (EYE-Llama: 0.57, Llama 2: 0.56, ChatDoctor: 0.56, and ChatGPT: 0.57). When evaluated on the MedMCQA test set, the fine-tuned models demonstrated a higher accuracy compared to the Llama 2 and ChatDoctor models (EYE-Llama: 0.39, Llama 2: 0.33, ChatDoctor: 0.29). However, ChatGPT outperformed EYE-Llama with an accuracy of 0.55. When tested with the PubmedQA set, the fine-tuned model showed improvement in accuracy over both the Llama 2, ChatGPT, and ChatDoctor models (EYE-Llama: 0.96, Llama 2: 0.90, ChatGPT: 0.93, ChatDoctor: 0.92).
The study shows that pre-training and fine-tuning LLMs like EYE-Llama enhances their performance in specific medical domains. Our EYE-Llama models surpass baseline Llama 2 in all evaluations, highlighting the effectiveness of specialized LLMs in medical QA systems. (Funded by NEI R15EY035804 (MNA) and UNC Charlotte Faculty Research Grant (MNA).).
使用领域内数据训练大语言模型(LLMs)可显著提高其性能,从而打造出更准确、可靠的问答(QA)系统,这对支持临床决策和患者教育至关重要。
本研究介绍了在精心整理的领域内眼科数据集上训练的大语言模型。我们还提供了一个用于模型训练的开源大型眼科语言数据集。我们的大语言模型(EYE-Llama)首先在特定于眼科的数据集上进行预训练,该数据集包括论文摘要、教科书、EyeWiki和维基百科文章。随后,使用各种问答数据集对模型进行微调。然后,使用四个不同的测试集,将每个阶段的大语言模型与基线Llama 2、ChatDoctor和ChatGPT(GPT3.5)模型进行比较,并由两位眼科医生进行定量(准确率、F1分数和BERTScore)和定性评估。
使用美国眼科学会(AAO)测试集并以BERTScore作为指标评估模型时,我们的模型在F1分数方面超过了Llama 2和ChatDoctor,并且与使用1750亿个参数训练的ChatGPT表现相当(EYE-Llama:0.57,Llama 2:0.56,ChatDoctor:0.56,ChatGPT:0.57)。在MedMCQA测试集上进行评估时,微调后的模型与Llama 2和ChatDoctor模型相比表现出更高的准确率(EYE-Llama:0.39,Llama 2:0.33,ChatDoctor:0.29)。然而,ChatGPT的准确率为0.55,超过了EYE-Llama。在使用PubmedQA集进行测试时,微调后的模型在准确率方面优于Llama 2、ChatGPT和ChatDoctor模型(EYE-Llama:0.96,Llama 2:0.90,ChatGPT:0.93,ChatDoctor:0.92)。
该研究表明,像EYE-Llama这样的大语言模型进行预训练和微调可提高其在特定医学领域的性能。我们的EYE-Llama模型在所有评估中都超过了基线Llama 2,突出了专业大语言模型在医学问答系统中的有效性。(由NEI R15EY035804(MNA)和北卡罗来纳大学夏洛特分校教师研究基金(MNA)资助。)