Cho Seungbeom, Lee Mangyeong, Yu Jaewook, Yoon Junghee, Choi Jae-Boong, Jung Kyu-Hwan, Cho Juhee
School of Mechanical Engineering, Sungkyunkwan University, Suwon-si, Republic of Korea.
Department of Digital Health, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Seoul, Republic of Korea.
J Med Internet Res. 2024 Dec 11;26:e63892. doi: 10.2196/63892.
Hospital call centers play a critical role in providing support and information to patients with cancer, making it crucial to effectively identify and understand patient intent during consultations. However, operational efficiency and standardization of telephone consultations, particularly when categorizing diverse patient inquiries, remain significant challenges. While traditional deep learning models like long short-term memory (LSTM) and bidirectional encoder representations from transformers (BERT) have been used to address these issues, they heavily depend on annotated datasets, which are labor-intensive and time-consuming to generate. Large language models (LLMs) like GPT-4, with their in-context learning capabilities, offer a promising alternative for classifying patient intent without requiring extensive retraining.
This study evaluates the performance of GPT-4 in classifying the purpose of telephone consultations of patients with cancer. In addition, it compares the performance of GPT-4 to that of discriminative models, such as LSTM and BERT, with a particular focus on their ability to manage ambiguous and complex queries.
We used a dataset of 430,355 sentences from telephone consultations with patients with cancer between 2016 and 2020. LSTM and BERT models were trained on 300,000 sentences using supervised learning, while GPT-4 was applied using zero-shot and few-shot approaches without explicit retraining. The accuracy of each model was compared using 1,000 randomly selected sentences from 2020 onward, with special attention paid to how each model handled ambiguous or uncertain queries.
GPT-4, which uses only a few examples (a few shots), attained a remarkable accuracy of 85.2%, considerably outperforming the LSTM and BERT models, which achieved accuracies of 73.7% and 71.3%, respectively. Notably, categories such as "Treatment," "Rescheduling," and "Symptoms" involve multiple contexts and exhibit significant complexity. GPT-4 demonstrated more than 15% superior performance in handling ambiguous queries in these categories. In addition, GPT-4 excelled in categories like "Records" and "Routine," where contextual clues were clear, outperforming the discriminative models. These findings emphasize the potential of LLMs, particularly GPT-4, for interpreting complicated patient interactions during cancer-related telephone consultations.
This study shows the potential of GPT-4 to significantly improve the classification of patient intent in cancer-related telephone oncological consultations. GPT-4's ability to handle complex and ambiguous queries without extensive retraining provides a substantial advantage over discriminative models like LSTM and BERT. While GPT-4 demonstrates strong performance in various areas, further refinement of prompt design and category definitions is necessary to fully leverage its capabilities in practical health care applications. Future research will explore the integration of LLMs like GPT-4 into hybrid systems that combine human oversight with artificial intelligence-driven technologies.
医院呼叫中心在为癌症患者提供支持和信息方面发挥着关键作用,因此在会诊期间有效识别和理解患者意图至关重要。然而,电话会诊的运营效率和标准化,尤其是在对各种患者咨询进行分类时,仍然是重大挑战。虽然像长短期记忆(LSTM)和来自变换器的双向编码器表示(BERT)等传统深度学习模型已被用于解决这些问题,但它们严重依赖注释数据集,而生成这些数据集既费力又耗时。像GPT-4这样的大语言模型(LLM)具有上下文学习能力,为在无需大量重新训练的情况下对患者意图进行分类提供了一种有前景的替代方案。
本研究评估GPT-4在对癌症患者电话会诊目的进行分类方面的性能。此外,将GPT-4的性能与判别模型(如LSTM和BERT)的性能进行比较,特别关注它们处理模糊和复杂查询的能力。
我们使用了2016年至2020年期间与癌症患者电话会诊的430355个句子的数据集。LSTM和BERT模型使用监督学习在300000个句子上进行训练,而GPT-4使用零样本和少样本方法应用,无需进行明确的重新训练。使用从2020年起随机选择的1000个句子比较每个模型的准确性,特别关注每个模型如何处理模糊或不确定的查询。
仅使用几个示例(少样本)的GPT-4达到了85.2%的显著准确率,大大超过了LSTM和BERT模型,它们的准确率分别为73.7%和71.3%。值得注意的是,“治疗”、“重新安排”和“症状”等类别涉及多个上下文并且表现出显著的复杂性。GPT-4在处理这些类别中的模糊查询方面表现出超过15%的卓越性能。此外,GPT-4在上下文线索清晰的“记录”和“常规”等类别中表现出色,优于判别模型。这些发现强调了大语言模型,特别是GPT-4,在解释癌症相关电话会诊期间复杂患者互动方面的潜力。
本研究表明GPT-4在显著改善癌症相关电话肿瘤会诊中患者意图分类方面的潜力。GPT-4无需大量重新训练就能处理复杂和模糊查询的能力比LSTM和BERT等判别模型具有显著优势。虽然GPT-4在各个领域都表现出强大的性能,但为了在实际医疗保健应用中充分利用其能力,还需要进一步完善提示设计和类别定义。未来的研究将探索将像GPT-4这样的大语言模型集成到结合人工监督与人工智能驱动技术的混合系统中。