Yang Jian, Shu Liqi, Duan Huilong, Li Haomin
IEEE J Biomed Health Inform. 2024 Sep 19;PP. doi: 10.1109/JBHI.2024.3464555.
Large language models (LLMs) hold significant promise in clinical practice, yet their real-world adoption is constrained by their propensity to produce erroneous and occasionally harmful outputs, particularly in the intricate domain of rare diseases (RDs). This study introduces RDguru, a conversational intelligent agent leveraging the LangChain framework and powered by GPT-3.5-turbo. RDguru offers a comprehensive suite of functionalities, encompassing evidence-traceable knowledge Q&A and professional medical consultations for differential diagnosis (DDX), integrating authoritative knowledge sources and reliable tools. A novel multi-source fusion diagnostic model, rooted in deep Q-network, amalgamates three diagnostic recommendation strategies (GPT-4, PheLR, and phenotype matching) to enhance diagnostic recall during medical consultations. Through tailored tools and advanced algorithms for retrieval-augmented generation, RDguru excels in knowledge Q&A, automated phenotype annotation, and RD DDX. A multi-aspect Q&A analysis demonstrates RDguru outperforms ChatGPT in generating descriptions aligned with authoritative knowledge, quantified by ROUGE scores, GPT-4-based automatic rating, and RAGAs evaluation metrics. Testing on 238 published RD cases reveals that RDguru's top 5 multi-source fusion diagnoses recapture 63.87% of actual diagnoses, marking a 5.47% improvement over the state-of-the-art diagnostic method PheLR. Furthermore, RDguru's consultation strategy proves effective in eliciting diagnostically beneficial phenotypes and refining the prioritization of genuine diagnoses through multi-round phenotype-orient questioning. Evaluations against established benchmarks and real-world patient data demonstrate RDguru's efficacy and reliability, highlighting its potential to enhance clinical decision-making in the realm of RDs.
大语言模型(LLMs)在临床实践中具有巨大的潜力,然而它们在现实世界中的应用受到其产生错误甚至有害输出倾向的限制,特别是在罕见病(RDs)这个复杂领域。本研究介绍了RDguru,这是一个利用LangChain框架并由GPT-3.5-turbo驱动的对话式智能代理。RDguru提供了一套全面的功能,包括可追溯证据的知识问答和用于鉴别诊断(DDX)的专业医学咨询,整合了权威知识来源和可靠工具。一种基于深度Q网络的新型多源融合诊断模型,融合了三种诊断推荐策略(GPT-4、PheLR和表型匹配),以提高医学咨询期间的诊断召回率。通过定制工具和用于检索增强生成的先进算法,RDguru在知识问答、自动表型注释和罕见病鉴别诊断方面表现出色。多方面的问答分析表明,RDguru在生成与权威知识一致的描述方面优于ChatGPT,这通过ROUGE分数、基于GPT-4的自动评分和RAGAs评估指标进行量化。对238个已发表的罕见病病例进行测试表明,RDguru的前5个多源融合诊断结果重新捕捉到了63.87%的实际诊断结果,比最先进的诊断方法PheLR提高了5.47%。此外,RDguru的咨询策略被证明在引出诊断有益的表型以及通过多轮以表型为导向的提问细化真正诊断的优先级方面是有效的。针对既定基准和真实世界患者数据的评估证明了RDguru的有效性和可靠性,突出了其在罕见病领域增强临床决策的潜力。