Lopez-Garcia Guillermo, Xu Dongfang, Luu Michael, Zheng Renning, Daskivich Timothy J, Gonzalez-Hernandez Graciela
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
Department of Biostatistics, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
medRxiv. 2025 Aug 11:2025.08.07.25333034. doi: 10.1101/2025.08.07.25333034.
Effective risk communication is essential to shared decision-making in prostate cancer care. However, the quality of physician communication of key tradeoffs varies widely in real-world consultations. Manual evaluation of communication is labor-intensive and not scalable. We present a structured, rubric-based framework that uses large language models (LLMs) to automatically score the quality of risk communication in prostate cancer consultations. Using transcripts from 20 clinical visits, we curated and annotated 487 physician-spoken sentences that referenced five decision-making domains: cancer prognosis, life expectancy, and three treatment side effects (erectile dysfunction, incontinence, and irritative urinary symptoms). Each sentence was assigned a score from 0 to 5 based on the precision and patient-specificity of communicated risk, using a validated scoring rubric. We modeled this task as five multiclass classification problems and evaluated both fine-tuned transformer baselines and GPT-4o with rubric-based and chain-of-thought (CoT) prompting. Our best performing approach, which combined rubric-based CoT prompting with few-shot learning, achieved micro averaged F1 scores between 85.0 and 92.0 across domains, outperforming supervised baselines and matching inter-annotator agreement. These findings establish a scalable foundation for AI-driven evaluation of physician-patient communication in oncology and beyond.
有效的风险沟通对于前列腺癌治疗中的共同决策至关重要。然而,在现实世界的会诊中,医生对关键权衡因素的沟通质量差异很大。人工评估沟通情况劳动强度大且无法扩展。我们提出了一个基于结构化评分标准的框架,该框架使用大语言模型(LLMs)自动对前列腺癌会诊中的风险沟通质量进行评分。利用20次临床就诊的记录,我们整理并注释了487句医生所说的涉及五个决策领域的句子:癌症预后、预期寿命以及三种治疗副作用(勃起功能障碍、尿失禁和刺激性尿路症状)。根据沟通风险的准确性和针对患者的特异性,使用经过验证的评分标准为每个句子赋予0到5分。我们将此任务建模为五个多类分类问题,并使用基于评分标准和思维链(CoT)提示的方法评估了微调后的Transformer基线模型和GPT-4o。我们表现最佳的方法是将基于评分标准的CoT提示与少样本学习相结合,在各个领域的微平均F1分数在85.0到92.0之间,优于监督基线模型且与注释者间的一致性相当。这些发现为人工智能驱动的肿瘤学及其他领域医患沟通评估奠定了可扩展的基础。