Siden Rachel, Kerman Hannah, Gallo Robert J, Cool Joséphine A, Hom Jason, Goh Ethan, Ahuja Neera, Heidenreich Paul, Shieh Lisa, Yang Daniel, Chen Jonathan H, Rodman Adam, Holdsworth Laura M
Department of Medicine, Stanford University School of Medicine, Palo Alto, CA, USA.
Division of Hospital Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
medRxiv. 2025 Jul 23:2025.07.23.25332002. doi: 10.1101/2025.07.23.25332002.
Large language model (LLM) chatbots demonstrate high degrees of accuracy, yet recent studies found that physicians using these same chatbots may score no better to worse on clinical reasoning tests compared to the chatbot performing alone with researcher-curated prompts. It is unknown how physicians approach inputting information into chatbots.
This study aimed to identify how physicians interacted with LLM chatbots on clinical reasoning tasks to create a typology of input approaches, exploring whether input approach type was associated with improved clinical reasoning performance.
We carried out a mixed methods study in three steps. First, we conducted semi-structured interviews with U.S. physicians on experiences using an LLM chatbot and analyzed transcripts using the Framework Method to develop a typology based on input patterns. Next, we analyzed the chat logs of physicians who used a chatbot while solving clinical cases, categorizing each case to an input approach type. Lastly, we used a linear mixed-effects model to compare each input approach type with performance on the clinical cases.
We identified four input approach types based on patterns of "content amount": copy-paster (entire case), selective copy-paster (pieces of a case), summarizer (user-generated case summary), and searcher (short queries). Copy-pasting and searching were utilized most. No single type was associated with scoring higher on clinical cases.
This study adds to our understanding of how physicians approach using chatbots and identifies ways in which physicians intuitively interact with chatbots.
Purposeful training and support is needed to help physicians effectively use emerging AI technologies and realize their potential for supporting safe and effective medical decision-making in practice.
大语言模型(LLM)聊天机器人显示出高度的准确性,但最近的研究发现,与仅使用研究人员精心设计的提示单独运行的聊天机器人相比,使用这些相同聊天机器人的医生在临床推理测试中的得分可能不会更好甚至更差。目前尚不清楚医生如何将信息输入到聊天机器人中。
本研究旨在确定医生在临床推理任务中如何与LLM聊天机器人交互,以创建输入方法的类型学,探讨输入方法类型是否与改善的临床推理表现相关。
我们分三步进行了一项混合方法研究。首先,我们对美国医生使用LLM聊天机器人的经历进行了半结构化访谈,并使用框架法分析了访谈记录,以根据输入模式开发一种类型学。接下来,我们分析了医生在解决临床病例时使用聊天机器人的聊天记录,将每个病例归类为一种输入方法类型。最后,我们使用线性混合效应模型将每种输入方法类型与临床病例的表现进行比较。
我们根据“内容量”模式确定了四种输入方法类型:复制粘贴者(整个病例)、选择性复制粘贴者(病例片段)、总结者(用户生成的病例总结)和搜索者(简短查询)。复制粘贴和搜索的使用最为频繁。没有一种单一类型与临床病例得分更高相关。
本研究增进了我们对医生如何使用聊天机器人的理解,并确定了医生与聊天机器人直观交互的方式。
需要有针对性的培训和支持,以帮助医生有效使用新兴的人工智能技术,并在实践中实现其支持安全有效医疗决策的潜力。