Goyal Aman, Sulaiman Samia Aziz, Alaarag Abdallah, Hoshan Waseem, Goyal Priya, Shah Viraj, Daoud Mohamed, Mahalwar Gauranga, Sheikh Abu Baker
Department of Internal Medicine, Cleveland Clinic Foundation, Cleveland, OH 44195, United States.
School of Medicine, The University of Jordan, Amman 11942, Jordan.
World J Cardiol. 2025 Aug 26;17(8):110489. doi: 10.4330/wjc.v17.i8.110489.
The integration of sophisticated large language models (LLMs) into healthcare has recently garnered significant attention due to their ability to leverage deep learning techniques to process vast datasets and generate contextually accurate, human-like responses. These models have been previously applied in medical diagnostics, such as in the evaluation of oral lesions. Given the high rate of missed diagnoses in pericarditis, LLMs may support clinicians in generating differential diagnoses-particularly in atypical cases where risk stratification and early identification are critical to preventing serious complications such as constrictive pericarditis and pericardial tamponade.
To compare the accuracy of LLMs in assisting the diagnosis of pericarditis as risk stratification tools.
A PubMed search was conducted using the keyword "pericarditis", applying filters for "case reports". Data from relevant cases were extracted. Inclusion criteria consisted of English-language reports involving patients aged 18 years or older with a confirmed diagnosis of acute pericarditis. The diagnostic capabilities of ChatGPT o1 and DeepThink-R1 were assessed by evaluating whether pericarditis was included in the top three differential diagnoses and as the sole provisional diagnosis. Each case was classified as either "yes" or "no" for inclusion.
From the initial search, 220 studies were identified, of which 16 case reports met the inclusion criteria. In assessing risk stratification for acute pericarditis, ChatGPT o1 correctly identified the condition in 10 of 16 cases (62.5%) in the differential diagnosis and in 8 of 16 cases (50.0%) as the provisional diagnosis. DeepThink-R1 identified it in 8 of 16 cases (50.0%) and 6 of 16 cases (37.5%), respectively. ChatGPT o1 demonstrated higher accuracy than DeepThink-R1 in identifying pericarditis.
Further research with larger sample sizes and optimized prompt engineering is warranted to improve diagnostic accuracy, particularly in atypical presentations.
复杂的大语言模型(LLMs)融入医疗保健领域最近备受关注,因为它们能够利用深度学习技术处理大量数据集,并生成上下文准确、类似人类的回应。这些模型此前已应用于医学诊断,如口腔病变评估。鉴于心包炎漏诊率高,大语言模型可能有助于临床医生进行鉴别诊断,特别是在非典型病例中,风险分层和早期识别对于预防诸如缩窄性心包炎和心包填塞等严重并发症至关重要。
比较大语言模型作为风险分层工具辅助诊断心包炎的准确性。
使用关键词“心包炎”在PubMed上进行搜索,并筛选“病例报告”。提取相关病例的数据。纳入标准包括涉及18岁及以上确诊急性心包炎患者的英文报告。通过评估心包炎是否被列入前三个鉴别诊断以及作为唯一的临时诊断来评估ChatGPT o1和DeepThink-R1的诊断能力。每个病例根据是否纳入分为“是”或“否”。
从初步搜索中,识别出220项研究,其中16篇病例报告符合纳入标准。在评估急性心包炎的风险分层时,ChatGPT o1在鉴别诊断中16例中的10例(62.5%)正确识别病情,在临时诊断中16例中的8例(50.0%)正确识别。DeepThink-R1分别在16例中的8例(50.0%)和16例中的6例(37.5%)中识别出病情。ChatGPT o1在识别心包炎方面表现出比DeepThink-R1更高的准确性。
有必要进行更大样本量和优化提示工程的进一步研究,以提高诊断准确性,特别是在非典型表现中。