Lee Sam Yu-Te, Bahukhandi Aryaman, Liu Dongyu, Ma Kwan-Liu
IEEE Trans Vis Comput Graph. 2025 Jan;31(1):481-491. doi: 10.1109/TVCG.2024.3456398. Epub 2024 Nov 25.
Recent advancements in Large Language Models (LLMs) and Prompt Engineering have made chatbot customization more accessible, significantly reducing barriers to tasks that previously required programming skills. However, prompt evaluation, especially at the dataset scale, remains complex due to the need to assess prompts across thousands of test instances within a dataset. Our study, based on a comprehensive literature review and pilot study, summarized five critical challenges in prompt evaluation. In response, we introduce a feature-oriented workflow for systematic prompt evaluation. In the context of text summarization, our workflow advocates evaluation with summary characteristics (feature metrics) such as complexity, formality, or naturalness, instead of using traditional quality metrics like ROUGE. This design choice enables a more user-friendly evaluation of prompts, as it guides users in sorting through the ambiguity inherent in natural language. To support this workflow, we introduce Awesum, a visual analytics system that facilitates identifying optimal prompt refinements for text summarization through interactive visualizations, featuring a novel Prompt Comparator design that employs a BubbleSet-inspired design enhanced by dimensionality reduction techniques. We evaluate the effectiveness and general applicability of the system with practitioners from various domains and found that (1) our design helps overcome the learning curve for non-technical people to conduct a systematic evaluation of summarization prompts, and (2) our feature-oriented workflow has the potential to generalize to other NLG and image-generation tasks. For future works, we advocate moving towards feature-oriented evaluation of LLM prompts and discuss unsolved challenges in terms of human-agent interaction.
大语言模型(LLMs)和提示工程的最新进展使聊天机器人定制变得更加容易,显著降低了以往需要编程技能的任务的障碍。然而,提示评估,尤其是在数据集规模上的评估,由于需要在数据集中的数千个测试实例上评估提示,仍然很复杂。我们基于全面的文献综述和试点研究的论文总结了提示评估中的五个关键挑战。作为回应,我们引入了一种面向特征的工作流程用于系统的提示评估。在文本摘要的背景下,我们的工作流程主张使用诸如复杂性、正式性或自然性等摘要特征(特征指标)进行评估,而不是使用像ROUGE这样的传统质量指标。这种设计选择能够对提示进行更用户友好的评估,因为它引导用户梳理自然语言中固有的模糊性。为了支持这个工作流程,我们引入了Awesum,这是一个可视化分析系统,通过交互式可视化促进识别文本摘要的最佳提示改进,其具有一种新颖的提示比较器设计,该设计采用了受BubbleSet启发并通过降维技术增强的设计。我们与来自各个领域的从业者一起评估了该系统的有效性和普遍适用性,发现(1)我们的设计有助于克服非技术人员对摘要提示进行系统评估的学习曲线,(2)我们面向特征的工作流程有可能推广到其他自然语言生成(NLG)和图像生成任务。对于未来的工作,我们主张朝着对大语言模型提示进行面向特征的评估方向发展,并讨论在人机交互方面尚未解决的挑战。