Coen Emma, Del Fiol Guilherme, Kaphingst Kimberly A, Borsato Emerson, Shannon Jackilen, Smith Hadley, Masino Aaron, Allen Caitlin G
School of Computing, Clemson University, 105 Sikes Hall, Clemson, SC, 29634, United States.
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States.
JMIR Cancer. 2025 Jun 10;11:e65848. doi: 10.2196/65848.
The increasing demand for population-wide genomic screening and the limited availability of genetic counseling resources have created a pressing need for innovative service delivery models. Chatbots powered by large language models (LLMs) have shown potential in genomic services, particularly in pretest counseling, but their application in returning positive population-wide genomic screening results remains underexplored. Leveraging advanced LLMs like GPT-4 offers an opportunity to address this gap by delivering accurate, contextual, and user-centered communication to individuals receiving positive genetic test results. This project aimed to design, implement, and evaluate a chatbot integrated with GPT-4, tailored to support the return of positive genomic screening results in the context of South Carolina's In Our DNA SC program. This initiative offers free genetic screening to 100,000 individuals, with over 33,000 results returned and numerous positive findings for conditions such as Lynch syndrome, hereditary breast and ovarian cancer syndrome, and familial hypercholesterolemia. A 3-step prompt engineering process using retrieval-augmented generation and few-shot techniques was used to create the chatbot. Training materials included patient frequently asked questions, genetic counseling scripts, and patient-derived queries. The chatbot underwent iterative refinement based on 13 training questions, while performance was evaluated through expert ratings on responses to 2 hypothetical patient scenarios. The 2 scenarios were intended to represent common but distinct patient profiles in terms of gender, race, ethnicity, age, and background knowledge. Domain experts rated the chatbot using a 5-point Likert scale across 8 predefined criteria: tone, clarity, program accuracy, domain accuracy, robustness, efficiency, boundaries, and usability. The chatbot achieved an average score of 3.86 (SD 0.89) across all evaluation metrics. The highest-rated criteria were tone (mean 4.25, SD 0.71) and usability (mean 4.25, SD 0.58), reflecting the chatbot's ability to communicate effectively and provide a seamless user experience. Boundary management (mean 4.0, SD 0.76) and efficiency (mean 3.88, SD 1.08) also scored well, while clarity and robustness received ratings of 3.81 (SD 1.05) and 3.81 (SD 0.66), respectively. Domain accuracy was rated 3.63 (SD 0.96), indicating satisfactory performance in delivering genetic information, whereas program accuracy received the lowest score of 3.25 (SD 1.39), highlighting the need for improvements in delivering program-specific details. This project demonstrates the feasibility of using LLM-powered chatbots to support the return of positive genomic screening results. The chatbot effectively handled open-ended patient queries, maintained conversational boundaries, and delivered user-friendly responses. However, enhancements in program-specific accuracy are essential to maximize its utility. Future research will explore hybrid chatbot designs that combine the strengths of LLMs with rule-based components to improve scalability, accuracy, and accessibility in genomic service delivery. The findings underscore the potential of generative artificial intelligence tools to address resource limitations and improve the accessibility of genomic health care services.
对全人群基因组筛查的需求不断增加,而遗传咨询资源有限,这迫切需要创新的服务提供模式。由大语言模型(LLMs)驱动的聊天机器人在基因组服务中已显示出潜力,特别是在检测前咨询方面,但其在反馈全人群基因组筛查阳性结果中的应用仍未得到充分探索。利用像GPT-4这样的先进大语言模型,有机会通过向收到阳性基因检测结果的个体提供准确、上下文相关且以用户为中心的沟通来弥补这一差距。本项目旨在设计、实施和评估一个与GPT-4集成的聊天机器人,该聊天机器人专为在南卡罗来纳州的“我们的DNA SC”项目背景下支持阳性基因组筛查结果的反馈而量身定制。该倡议为10万人提供免费基因筛查,已返回超过3.3万个结果,并发现了许多诸如林奇综合征、遗传性乳腺癌和卵巢癌综合征以及家族性高胆固醇血症等疾病的阳性结果。使用检索增强生成和少样本技术的三步提示工程过程来创建聊天机器人。培训材料包括患者常见问题、遗传咨询脚本和患者提出的问题。聊天机器人根据13个训练问题进行了迭代优化,同时通过专家对2个假设患者场景的回复评分来评估性能。这2个场景旨在代表在性别、种族、民族、年龄和背景知识方面常见但不同的患者概况。领域专家使用5点李克特量表对聊天机器人在8个预定义标准上进行评分:语气、清晰度、程序准确性、领域准确性、稳健性、效率、边界和可用性。聊天机器人在所有评估指标上的平均得分为3.86(标准差0.89)。评分最高的标准是语气(平均4.25,标准差0.71)和可用性(平均4.25,标准差0.58),反映了聊天机器人有效沟通并提供无缝用户体验的能力。边界管理(平均4.0,标准差0.76)和效率(平均3.88,标准差1.08)也得分较高,而清晰度和稳健性的评分分别为3.81(标准差1.05)和3.81(标准差0.66)。领域准确性评分为3.63(标准差0.96),表明在提供遗传信息方面表现令人满意,而程序准确性得分最低,为3.25(标准差1.39),突出了在提供特定程序细节方面需要改进。本项目证明了使用由大语言模型驱动的聊天机器人来支持阳性基因组筛查结果反馈的可行性。该聊天机器人有效地处理了开放式患者问题,保持了对话边界,并提供了用户友好的回复。然而,提高特定程序的准确性对于最大化其效用至关重要。未来的研究将探索混合聊天机器人设计,将大语言模型的优势与基于规则的组件相结合,以提高基因组服务提供的可扩展性准确性和可及性。这些发现强调了生成式人工智能工具在解决资源限制和提高基因组医疗服务可及性方面的潜力。