利用大语言模型进行化疗诱导毒性的精准监测：一项专家比较及未来方向的试点研究

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.

作者信息

Ruiz Sarrias Oskitz, Martínez Del Prado María Purificación, Sala Gonzalez María Ángeles, Azcuna Sagarduy Josune, Casado Cuesta Pablo, Figaredo Berjano Covadonga, Galve-Calvo Elena, López de San Vicente Hernández Borja, López-Santillán María, Nuño Escolástico Maitane, Sánchez Togneri Laura, Sande Sardina Laura, Pérez Hoyos María Teresa, Abad Villar María Teresa, Zabalza Zudaire Maialen, Sayar Beristain Onintza

机构信息

Department of Mathematics and Statistic, NNBi 2020 SL, 31110 Noain, Navarra, Spain.

Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain.

出版信息

Cancers (Basel). 2024 Aug 12;16(16):2830. doi: 10.3390/cancers16162830.

DOI:10.3390/cancers16162830

PMID:39199603

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11352281/

Abstract

INTRODUCTION

Large Language Models (LLMs), such as the GPT model family from OpenAI, have demonstrated transformative potential across various fields, especially in medicine. These models can understand and generate contextual text, adapting to new tasks without specific training. This versatility can revolutionize clinical practices by enhancing documentation, patient interaction, and decision-making processes. In oncology, LLMs offer the potential to significantly improve patient care through the continuous monitoring of chemotherapy-induced toxicities, which is a task that is often unmanageable for human resources alone. However, existing research has not sufficiently explored the accuracy of LLMs in identifying and assessing subjective toxicities based on patient descriptions. This study aims to fill this gap by evaluating the ability of LLMs to accurately classify these toxicities, facilitating personalized and continuous patient care.

METHODS

This comparative pilot study assessed the ability of an LLM to classify subjective toxicities from chemotherapy. Thirteen oncologists evaluated 30 fictitious cases created using expert knowledge and OpenAI's GPT-4. These evaluations, based on the CTCAE v.5 criteria, were compared to those of a contextualized LLM model. Metrics such as mode and mean of responses were used to gauge consensus. The accuracy of the LLM was analyzed in both general and specific toxicity categories, considering types of errors and false alarms. The study's results are intended to justify further research involving real patients.

RESULTS

The study revealed significant variability in oncologists' evaluations due to the lack of interaction with fictitious patients. The LLM model achieved an accuracy of 85.7% in general categories and 64.6% in specific categories using mean evaluations with mild errors at 96.4% and severe errors at 3.6%. False alarms occurred in 3% of cases. When comparing the LLM's performance to that of expert oncologists, individual accuracy ranged from 66.7% to 89.2% for general categories and 57.0% to 76.0% for specific categories. The 95% confidence intervals for the median accuracy of oncologists were 81.9% to 86.9% for general categories and 67.6% to 75.6% for specific categories. These benchmarks highlight the LLM's potential to achieve expert-level performance in classifying chemotherapy-induced toxicities.

DISCUSSION

The findings demonstrate that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM achieved 85.7% accuracy in general categories and 64.6% in specific categories. While the model's general category performance falls within expert ranges, specific category accuracy requires improvement. The study's limitations include the use of fictitious cases, lack of patient interaction, and reliance on audio transcriptions. Nevertheless, LLMs show significant potential for enhancing patient monitoring and reducing oncologists' workload. Future research should focus on the specific training of LLMs for medical tasks, conducting studies with real patients, implementing interactive evaluations, expanding sample sizes, and ensuring robustness and generalization in diverse clinical settings.

CONCLUSIONS

This study concludes that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM's performance in general toxicity categories is within the expert range, but there is room for improvement in specific categories. LLMs have the potential to enhance patient monitoring, enable early interventions, and reduce severe complications, improving care quality and efficiency. Future research should involve specific training of LLMs, validation with real patients, and the incorporation of interactive capabilities for real-time patient interactions. Ethical considerations, including data accuracy, transparency, and privacy, are crucial for the safe integration of LLMs into clinical practice.

摘要

引言

大型语言模型（LLMs），如OpenAI的GPT模型家族，已在各个领域展现出变革潜力，尤其是在医学领域。这些模型能够理解并生成上下文相关文本，无需特定训练即可适应新任务。这种多功能性可通过改进文档记录、患者互动和决策过程，给临床实践带来变革。在肿瘤学中，大型语言模型有潜力通过持续监测化疗引起的毒性反应显著改善患者护理，而这一任务单靠人力往往难以完成。然而，现有研究尚未充分探索大型语言模型基于患者描述识别和评估主观毒性的准确性。本研究旨在通过评估大型语言模型准确分类这些毒性反应的能力来填补这一空白，以促进个性化和持续的患者护理。

方法

这项比较性试点研究评估了一个大型语言模型对化疗主观毒性进行分类的能力。13名肿瘤学家对利用专业知识和OpenAI的GPT-4创建的30个虚拟病例进行了评估。这些基于《癌症治疗和不良反应通用术语标准》第5版（CTCAE v.5）标准的评估结果，与一个上下文感知大型语言模型的评估结果进行了比较。使用诸如反应模式和均值等指标来衡量一致性。从总体和特定毒性类别两方面分析了大型语言模型的准确性，同时考虑了错误类型和误报情况。该研究结果旨在为涉及真实患者的进一步研究提供依据。

结果

研究发现，由于缺乏与虚拟患者的互动，肿瘤学家的评估存在显著差异。大型语言模型在总体类别中的准确率为85.7%，在特定类别中的准确率为64.6%，使用均值评估时，轻度错误率为96.4%，严重错误率为3.6%。3%的病例出现了误报。将大型语言模型的表现与肿瘤专家的表现进行比较时，总体类别中个体准确率在66.7%至89.2%之间，特定类别中在57.0%至76.0%之间。肿瘤学家中位数准确率的95%置信区间，总体类别为81.9%至86.9%，特定类别为67.6%至75.6%。这些基准突出了大型语言模型在化疗引起的毒性分类中达到专家级表现的潜力。

讨论

研究结果表明，大型语言模型能够以与肿瘤专家相当的准确率对化疗主观毒性进行分类。大型语言模型在总体类别中的准确率为85.7%，在特定类别中的准确率为64.6%。虽然该模型在总体类别中的表现处于专家范围内，但特定类别准确率仍需提高。该研究的局限性包括使用虚拟病例、缺乏患者互动以及依赖音频转录。尽管如此，大型语言模型在加强患者监测和减轻肿瘤学家工作量方面显示出巨大潜力。未来研究应专注于针对医疗任务对大型语言模型进行特定训练，开展涉及真实患者的研究，实施交互式评估，扩大样本量，并确保在不同临床环境中的稳健性和通用性。

结论

本研究得出结论，大型语言模型能够以与肿瘤专家相当的准确率对化疗主观毒性进行分类。大型语言模型在总体毒性类别中的表现处于专家范围内，但特定类别仍有改进空间。大型语言模型有潜力加强患者监测，实现早期干预，减少严重并发症，提高护理质量和效率。未来研究应包括对大型语言模型进行特定训练，用真实患者进行验证，并纳入实时患者互动的交互功能。包括数据准确性、透明度和隐私在内的伦理考量，对于将大型语言模型安全整合到临床实践中至关重要。