评估大语言模型聊天机器人生成通俗易懂摘要的能力。

Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries.

作者信息

Mondal Himel, Gupta Gaurav, Sarangi Pradosh Kumar, Sharma Shreya, Choudhary Pritam K, Juhi Ayesha, Kumari Anita, Mondal Shaikat

机构信息

Physiology, All India Institute of Medical Sciences, Deoghar, IND.

Pediatrics, All India Institute of Medical Sciences, Guwahati, IND.

出版信息

Cureus. 2025 Mar 21;17(3):e80976. doi: 10.7759/cureus.80976. eCollection 2025 Mar.

DOI:10.7759/cureus.80976

PMID:40260353

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12010112/

Abstract

Background Plain language summaries (PLSs) make scientific research accessible to a broad non-expert audience. However, crafting effective PLS can be challenging, particularly for non-native English-speaking researchers. Large language model (LLM) chatbots have the potential to assist in generating summaries, but their effectiveness compared to human-generated PLS remains underexplored. Methods This cross-sectional study compared 30 human-written PLS with LLM chatbot (viz., ChatGPT (OpenAI, San Francisco, CA), Claude (Anthropic, San Francisco, CA), Copilot (Microsoft Corp., Washington, DC), Gemini (Google, Mountain View, CA), Meta AI (Meta, Menlo Park, CA), and Perplexity (Perplexity AI, Inc., San Francisco, CA)) generated PLS. The readability of the PLS was checked by the Flesch reading (FR) ease score, and understandability was checked by the Flesch-Kincaid (FK) grade level. Three authors rated the text on seven-item predefined criteria, and their average score was used to compare the quality of the PLS. Results In comparison to human-written PLS, chatbots could generate PLS with lower FK grade levels (p-value < 0.0001) and except Copilot, all others had higher FR ease scores. The overall score of human-written PLS was 8.89±0.26. Although there was statistically significant variance among the scores (F = 7.16, p-value = 0.0012), in the post-hoc test, there was no difference between human-generated and individual chatbots-generated PLS (ChatGPT 8.8±0.34, Claude 8.89±0.33, Copilot 8.69±0.4, Gemini 8.56±0.56, Meta AI 8.98±0.23, and Perplexity 8.8±0.3). Conclusion LLM chatbots can generate PLS with better readability and a person with a lower grade of education can understand it. The PLS are of similar quality to those written by human authors. Hence, authors can generate PLS from LLM chatbots and it is particularly beneficial for researchers in developing countries. While LLM chatbots improve readability, they may introduce minor inaccuracies also. Hence, PLS generated by LLM should always checked for accuracy and relevancy.

摘要

背景简明语言摘要（PLS）使广大非专业受众能够理解科学研究。然而，撰写有效的PLS可能具有挑战性，尤其是对于非英语母语的研究人员而言。大语言模型（LLM）聊天机器人有潜力协助生成摘要，但与人工撰写的PLS相比，其有效性仍未得到充分探索。方法这项横断面研究将30篇人工撰写的PLS与LLM聊天机器人（即ChatGPT（OpenAI，加利福尼亚州旧金山）、Claude（Anthropic，加利福尼亚州旧金山）、Copilot（微软公司，华盛顿特区）、Gemini（谷歌，加利福尼亚州山景城）、Meta AI（Meta，加利福尼亚州门洛帕克）和Perplexity（Perplexity AI公司，加利福尼亚州旧金山））生成的PLS进行了比较。通过弗莱什阅读（FR）易读性分数检查PLS的可读性，并通过弗莱什-金凯德（FK）年级水平检查可理解性。三位作者根据七项预定义标准对文本进行评分，并使用他们的平均分数来比较PLS的质量。结果与人工撰写的PLS相比，聊天机器人可以生成FK年级水平较低的PLS（p值<0.0001），除Copilot外，其他所有聊天机器人的FR易读性分数都更高。人工撰写的PLS的总体分数为8.89±0.26。尽管分数之间存在统计学上的显著差异（F = 7.16，p值 = 0.0012），但在事后检验中，人工生成的PLS与单个聊天机器人生成的PLS之间没有差异（ChatGPT 8.8±0.34、Claude 8.89±0.33、Copilot 8.69±0.4、Gemini 8.56±0.56、Meta AI 8.98±0.23和Perplexity 8.8±0.3）。结论 LLM聊天机器人可以生成可读性更好且受教育程度较低的人也能理解的PLS。这些PLS的质量与人工作者撰写的PLS相似。因此，作者可以从LLM聊天机器人生成PLS，这对发展中国家的研究人员特别有益。虽然LLM聊天机器人提高了可读性，但它们也可能引入一些小的不准确之处。因此，应由LLM生成的PLS始终要检查其准确性和相关性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大语言模型聊天机器人生成通俗易懂摘要的能力。

Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

评估大语言模型聊天机器人生成通俗易懂摘要的能力。

Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献