Coen Emma, Del Fiol Guilherme, Kaphingst Kimberly A, Borsato Emerson, Shannon Jackie, Smith Hadley Stevens, Masino Aaron, Allen Caitlin G
Clemson University.
University of Utah.
Res Sq. 2024 Aug 29:rs.3.rs-4986527. doi: 10.21203/rs.3.rs-4986527/v1.
The growing demand for genomic testing and limited access to experts necessitate innovative service models. While chatbots have shown promise in supporting genomic services like pre-test counseling, their use in returning positive genetic results, especially using the more recent large language models (LLMs) remains unexplored.
This study reports the prompt engineering process and intrinsic evaluation of the LLM component of a chatbot designed to support returning positive population-wide genomic screening results.
We used a three-step prompt engineering process, including Retrieval-Augmented Generation (RAG) and few-shot techniques to develop an open-response chatbot. This was then evaluated using two hypothetical scenarios, with experts rating its performance using a 5-point Likert scale across eight criteria: tone, clarity, program accuracy, domain accuracy, robustness, efficiency, boundaries, and usability.
The chatbot achieved an overall score of 3.88 out of 5 across all criteria and scenarios. The highest ratings were in Tone (4.25), Usability (4.25), and Boundary management (4.0), followed by Efficiency (3.88), Clarity and Robustness (3.81), and Domain Accuracy (3.63). The lowest-rated criterion was Program Accuracy, which scored 3.25.
The LLM handled open-ended queries and maintained boundaries, while the lower Program Accuracy rating indicates areas for improvement. Future work will focus on refining prompts, expanding evaluations, and exploring optimal hybrid chatbot designs that integrate LLM components with rule-based chatbot components to enhance genomic service delivery.
对基因组检测的需求不断增长,且获取专家服务的机会有限,这就需要创新的服务模式。虽然聊天机器人在支持诸如检测前咨询等基因组服务方面已显示出前景,但它们在返回阳性基因检测结果方面的应用,尤其是使用更新的大语言模型(LLM)的情况仍未得到探索。
本研究报告了一个旨在支持返回全人群基因组筛查阳性结果的聊天机器人的大语言模型组件的提示工程过程和内在评估。
我们采用了一个三步提示工程过程,包括检索增强生成(RAG)和少样本技术来开发一个开放式响应聊天机器人。然后使用两个假设场景对其进行评估,专家们使用5点李克特量表对其在八个标准上的表现进行评分:语气、清晰度、程序准确性、领域准确性、稳健性、效率、边界和可用性。
在所有标准和场景下,聊天机器人的总体得分为3.88分(满分5分)。评分最高的是语气(4.25分)、可用性(4.25分)和边界管理(4.0分),其次是效率(3.88分)、清晰度和稳健性(3.81分)以及领域准确性(3.63分)。评分最低的标准是程序准确性,得分为3.25分。
大语言模型处理了开放式查询并保持了边界,而较低的程序准确性评分表明存在改进的空间。未来的工作将集中在完善提示、扩大评估以及探索将大语言模型组件与基于规则的聊天机器人组件相结合的最佳混合聊天机器人设计,以增强基因组服务的提供。