Kaiyrbekov Kurmanbek, Dobbins Nicholas J, Mooney Sean D
Cyberinfrastructure and Artificial Intelligence Platforms Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA.
Biomedical Informatics & Data Science, Department of Medicine, Johns Hopkins University, Baltimore, Maryland, USA.
ArXiv. 2025 Apr 2:arXiv:2504.02891v1.
Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs).
Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT- 4o and overall participant experience.
Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction.
Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems.
传统的基于电话的调查是收集生物医学和医疗保健数据最容易获得且使用最广泛的方法之一,然而,它们通常成本高昂、劳动密集且难以有效扩展。为了克服这些限制,我们提出了一个由对话式大语言模型(LLM)驱动的端到端调查收集框架。
我们的框架包括一名负责设计调查和招募参与者的研究人员、一个由LLM驱动的对话式电话代理,该代理致电参与者并进行调查、第二个LLM(GPT-4o),用于分析调查期间生成的对话记录,以及一个用于存储和整理结果的数据库。为了测试我们的框架,我们招募了8名参与者,其中包括5名以英语为母语的人和3名非英语母语者,并进行了40次调查。我们评估了LLM生成的对话记录的正确性、GPT-4o推断的调查回复的准确性以及总体参与者体验。
尽管对话记录每行的平均单词错误率为7.7%,GPT-4o仍成功从对话记录中提取了调查回复,平均准确率为98%。虽然参与者指出对话式LLM代理偶尔会出错,但他们报告说该代理有效地传达了调查目的,表现出良好的理解能力,并保持了引人入胜的互动。
我们的研究突出了LLM代理在开展和分析医疗保健应用电话调查方面的潜力。通过减轻人类访谈员的工作量并提供可扩展的解决方案,这种方法为现实世界中由人工智能驱动的端到端电话调查收集系统铺平了道路。