Koh Sky Wei Chee, Wong Eunice Rui Ning, Tan John Chong Min, van der Lubbe Stephanie C C, Goh Jun Cong, Ching Ethan Sheng Yong, Chia Ian Wen Yih, Low Si Hui, Ang Ping Young, Quek Queenie, Motani Mehul, Valderas Jose M
Division of Family Medicine, Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, NUHS Tower Block Level 9, 1E Kent Ridge Road, Singapore, 119228, Singapore, 65 67163185.
National University Polyclinics, National University Health System, Singapore, Singapore.
J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.
Patient complaints provide valuable insights into the performance of health care systems, highlighting potential risks not apparent to staff. Patient complaints can drive systemic changes that enhance patient safety. However, manual categorization and analysis pose a huge logistical challenge, hindering the ability to harness the potential of these data.
This study aims to evaluate the accuracy of artificial intelligence (AI)-powered categorization of patient complaints in primary care based on the Healthcare Complaint Analysis Tool (HCAT) General Practice (GP) taxonomy and assess the importance of advanced large language models (LLMs) in complaint categorization.
This cross-sectional study analyzed 1816 anonymous patient complaints from 7 public primary care clinics in Singapore. Complaints were first coded by trained human coders using the HCAT (GP) taxonomy through a rigorous process involving independent assessment and consensus discussions. LLMs (GPT-3.5 turbo, GPT-4o mini, and Claude 3.5 Sonnet) were used to validate manual classification. Claude 3.5 Sonnet was further used to identify complaint themes. LLM classifications were assessed for accuracy and consistency with human coding using accuracy and F1-score. Cohen κ and McNemar test evaluated AI-human agreement and compared AI models' concordance, respectively.
The majority of complaints fell under the HCAT (GP) domain of management (1079/1816, 59.4%), specifically relating to institutional processes (830/1816, 45.7%). Most complaints were of medium severity (994/1816, 54.7%), occurred within the practice (627/1816, 34.5%), and resulted in minimal harm (75.4%). LLMs achieved moderate to good accuracy (58.4%-95.5%) in HCAT (GP) field classifications, with GPT-4o mini generally outperforming GPT-3.5 turbo, except in severity classification. All 3 LLMs demonstrated moderate concordance rates (average 61.9%-68.8%) in complaints classification with varying levels of agreement (κ=0.114-0.623). GPT-4o mini and Claude 3.5 significantly outperformed GPT-3.5 turbo in several fields (P<.05), such as domain and stage of care classification. Thematic analysis using Claude 3.5 identified long wait times (393/1816, 21.6%), staff attitudes (287/1816, 15.8%), and appointment booking issues (191/1816, 10.5%) as the top concerns, which accounted for nearly half of all complaints.
Our study highlighted the potential of LLMs in classifying patient complaints in primary care using HCAT (GP) taxonomy. While GPT-4o and Claude 3.5 demonstrated promising results, further fine-tuning and model training are required to improve accuracy. Integrating AI into complaint analysis can facilitate proactive identification of systemic issues, ultimately enhancing quality improvement and patient safety. By leveraging LLMs, health care organizations can prioritize complaints and escalate high-risk issues more effectively. Theoretically, this could lead to improved patient care and experience; further research is needed to confirm this potential benefit.
患者投诉能为医疗系统的绩效提供宝贵见解,凸显员工不易察觉的潜在风险。患者投诉可推动系统性变革,提升患者安全。然而,人工分类和分析带来巨大的后勤挑战,阻碍了挖掘这些数据潜力的能力。
本研究旨在评估基于医疗投诉分析工具(HCAT)全科医学(GP)分类法的人工智能(AI)对基层医疗中患者投诉进行分类的准确性,并评估先进的大语言模型(LLMs)在投诉分类中的重要性。
这项横断面研究分析了新加坡7家公立基层医疗诊所的1816份匿名患者投诉。投诉首先由经过培训的人工编码员使用HCAT(GP)分类法,通过包括独立评估和共识讨论的严格流程进行编码。使用大语言模型(GPT - 3.5 turbo、GPT - 4o mini和Claude 3.5 Sonnet)来验证人工分类。进一步使用Claude 3.5 Sonnet来识别投诉主题。使用准确率和F1分数评估大语言模型分类与人工编码的准确性和一致性。Cohen κ检验和McNemar检验分别评估人工智能与人工的一致性,并比较人工智能模型的一致性。
大多数投诉属于HCAT(GP)管理领域(1079/1816,59.4%),具体涉及机构流程(830/1816,45.7%)。大多数投诉严重程度为中等(994/1816,54.7%),发生在诊所内(627/1816,34.5%),造成的伤害最小(75.4%)。大语言模型在HCAT(GP)领域分类中达到了中等至良好的准确率(58.4% - 95.5%),除严重程度分类外,GPT - 4o mini总体上优于GPT - 3.5 turbo。所有3个大语言模型在投诉分类中表现出中等的一致性率(平均61.9% - 68.8%),一致性水平各不相同(κ = 0.114 - 0.623)。GPT - 4o mini和Claude 3.5在几个领域(如领域和护理阶段分类)显著优于GPT - 3.5 turbo(P <.05)。使用Claude 3.5进行的主题分析确定等待时间长(393/1816,21.6%)、员工态度(287/1816,15.8%)和预约挂号问题(191/1816,10.5%)是最主要的问题,占所有投诉的近一半。
我们的研究凸显了大语言模型在使用HCAT(GP)分类法对基层医疗中的患者投诉进行分类方面的潜力。虽然GPT - 4o和Claude 3.5展示了有前景的结果,但需要进一步微调及模型训练以提高准确性。将人工智能整合到投诉分析中可促进主动识别系统性问题,最终加强质量改进和患者安全。通过利用大语言模型,医疗保健组织可以更有效地对投诉进行优先级排序并升级高风险问题。从理论上讲,这可能会改善患者护理和体验;需要进一步研究来证实这种潜在益处。