Li Ronghao, Mao Shuai, Zhu Congmin, Yang Yingliang, Tan Chunting, Li Li, Mu Xiangdong, Liu Honglei, Yang Yuqing
School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, China, 86 010-83911542.
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
The rapid advancements in natural language processing, particularly the development of large language models (LLMs), have opened new avenues for managing complex clinical text data. However, the inherent complexity and specificity of medical texts present significant challenges for the practical application of prompt engineering in diagnostic tasks.
This paper explores LLMs with new prompt engineering technology to enhance model interpretability and improve the prediction performance of pulmonary disease based on a traditional deep learning model.
A retrospective dataset including 2965 chest CT radiology reports was constructed. The reports were from 4 cohorts, namely, healthy individuals and patients with pulmonary tuberculosis, lung cancer, and pneumonia. Then, a novel prompt engineering strategy that integrates feature summarization (F-Sum), chain of thought (CoT) reasoning, and a hybrid retrieval-augmented generation (RAG) framework was proposed. A feature summarization approach, leveraging term frequency-inverse document frequency (TF-IDF) and K-means clustering, was used to extract and distill key radiological findings related to 3 diseases. Simultaneously, the hybrid RAG framework combined dense and sparse vector representations to enhance LLMs' comprehension of disease-related text. In total, 3 state-of-the-art LLMs, GLM-4-Plus, GLM-4-air (Zhipu AI), and GPT-4o (OpenAI), were integrated with the prompt strategy to evaluate the efficiency in recognizing pneumonia, tuberculosis, and lung cancer. The traditional deep learning model, BERT (Bidirectional Encoder Representations from Transformers), was also compared to assess the superiority of LLMs. Finally, the proposed method was tested on an external validation dataset consisted of 343 chest computed tomography (CT) report from another hospital.
Compared with BERT-based prediction model and various other prompt engineering techniques, our method with GLM-4-Plus achieved the best performance on test dataset, attaining an F1-score of 0.89 and accuracy of 0.89. On the external validation dataset, F1-score (0.86) and accuracy (0.92) of the proposed method with GPT-4o were the highest. Compared to the popular strategy with manually selected typical samples (few-shot) and CoT designed by doctors (F1-score=0.83 and accuracy=0.83), the proposed method that summarized disease characteristics (F-Sum) based on LLM and automatically generated CoT performed better (F1-score=0.89 and accuracy=0.90). Although the BERT-based model got similar results on the test dataset (F1-score=0.85 and accuracy=0.88), its predictive performance significantly decreased on the external validation set (F1-score=0.48 and accuracy=0.78).
These findings highlight the potential of LLMs to revolutionize pulmonary disease prediction, particularly in resource-constrained settings, by surpassing traditional models in both accuracy and flexibility. The proposed prompt engineering strategy not only improves predictive performance but also enhances the adaptability of LLMs in complex medical contexts, offering a promising tool for advancing disease diagnosis and clinical decision-making.
自然语言处理的快速发展,尤其是大语言模型(LLMs)的发展,为管理复杂的临床文本数据开辟了新途径。然而,医学文本固有的复杂性和特殊性给提示工程在诊断任务中的实际应用带来了重大挑战。
本文探索使用新的提示工程技术的大语言模型,以增强模型的可解释性,并在传统深度学习模型的基础上提高肺部疾病的预测性能。
构建了一个包含2965份胸部CT放射学报告的回顾性数据集。这些报告来自4个队列,即健康个体以及患有肺结核、肺癌和肺炎的患者。然后,提出了一种新颖的提示工程策略,该策略整合了特征总结(F-Sum)、思维链(CoT)推理和混合检索增强生成(RAG)框架。一种利用词频-逆文档频率(TF-IDF)和K均值聚类的特征总结方法,用于提取和提炼与3种疾病相关的关键放射学发现。同时,混合RAG框架结合了密集和稀疏向量表示,以增强大语言模型对疾病相关文本的理解。总共将3个最先进的大语言模型GLM-4-Plus、GLM-4-air(智谱AI)和GPT-4o(OpenAI)与该提示策略相结合,以评估识别肺炎、肺结核和肺癌的效率。还比较了传统深度学习模型BERT(来自Transformer的双向编码器表示),以评估大语言模型的优越性。最后,在由另一家医院的343份胸部计算机断层扫描(CT)报告组成的外部验证数据集上对所提出的方法进行了测试。
与基于BERT的预测模型和其他各种提示工程技术相比,我们使用GLM-4-Plus的方法在测试数据集上取得了最佳性能,F1分数为0.89,准确率为0.89。在外部验证数据集上,使用GPT-4o的所提出方法的F1分数(0.86)和准确率(0.92)最高。与由医生设计的手动选择典型样本(少样本)和思维链的流行策略(F1分数=0.83,准确率=0.83)相比,基于大语言模型总结疾病特征(F-Sum)并自动生成思维链的所提出方法表现更好(F1分数=0.89,准确率=0.90)。尽管基于BERT的模型在测试数据集上得到了类似的结果(F1分数=0.85,准确率=0.88),但其在外部验证集上的预测性能显著下降(F1分数=0.48,准确率=0.78)。
这些发现凸显了大语言模型在肺部疾病预测方面进行变革的潜力,特别是在资源有限的环境中,因为它在准确性和灵活性方面都超过了传统模型。所提出的提示工程策略不仅提高了预测性能,还增强了大语言模型在复杂医疗环境中的适应性,为推进疾病诊断和临床决策提供了一个有前景的工具。