Suppr超能文献

大语言模型生成的男性乳腺增生症材料中文本质量的多标准评估:DeepSeek 与 OpenAI 与 Claude 的比较

Multicriteria Assessment of Text Quality in Large Language Model-Generated Gynecomastia Materials: DeepSeek Versus OpenAI Versus Claude.

作者信息

Zang Tianying, Li Jiaojiao, Wei Lisha, Wang Yijin

机构信息

Department of Aesthetic Plastic Surgery and Laser Medicine, Beijing Anzhen Hospital Affiliated to Capital Medical University.

Department of Breast Plastic Surgery, Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shijingshan, Beijing, China.

出版信息

J Craniofac Surg. 2025 Sep 10. doi: 10.1097/SCS.0000000000011930.

Abstract

BACKGROUND

With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.

OBJECTIVE

This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.

METHODS

This study selected 10 most frequently searched questions about gynecomastia from PubMed and Google Trends. Responses were generated using 3 LLMs (DeepSeek-R1, OpenAI-O3, Claude-4-Sonnet), with text quality assessed using the DISCERN-AI and PEMAT-AI scales. Text readability and legibility were comprehensively evaluated through metrics including word count, syllable count, Flesch-Kincaid Grade Level (FKGL), Flesch Kincaid Reading Ease (FKRE), SMOG index, and Automated Readability Index (ARI).

RESULTS

In terms of quality evaluation, among the 10 items of the DISCERN-AI scale, only the overall content quality score showed a statistically significant difference (P = 0.001), with DeepSeek-R1 demonstrating the best performance at a median score of 5 (5,5). Regarding readability, DeepSeek-R1 exhibited the highest average word count and syllable count, both with P-values of 0.000. The 3 models showed no significant differences in FKGL, FKRE, or automatic readability indices. Specifically, the averaged FKGL scores of DeepSeek-R1 was 14.08, OpenAI-O3 was 14.1, and Claude-4-sonnet was 13.31. The SOMG evaluation revealed that Claude-4-sonnet demonstrated the strongest readability, the average value is 11 with a P-value of 0.028.

CONCLUSION

DeepSeek-R1 demonstrated the highest overall quality in content generation, followed by Claude-4-sonnet. Evaluations using FKGL, SMOG index, and ARI all indicated that Claude-4-sonnet exhibited the best readability. Given that improvements in quality and readability can enhance patient engagement and reduce anxiety, these 2 models should be prioritized for patient education applications. Future efforts should focus on integrating these advantages to develop more reliable large-scale medical language models.

摘要

背景

随着人工智能的发展,通过大语言模型(LLMs)获取以患者为中心的医学信息对于患者教育至关重要。然而,在线医疗保健中的现有数字资源质量参差不齐,需要评估各种人工智能模型生成内容的可靠性和可读性,以满足不同文化素养水平患者的需求。

目的

本研究旨在比较不同大语言模型在提供与男性乳房发育相关医学信息方面的准确性和可读性,并探索在实际临床应用中最有前景的科学教育工具。

方法

本研究从PubMed和谷歌趋势中选取了10个关于男性乳房发育最常搜索的问题。使用3种大语言模型(DeepSeek-R1、OpenAI-O3、Claude-4-Sonnet)生成回答,并使用DISCERN-AI和PEMAT-AI量表评估文本质量。通过单词计数、音节计数、弗莱什-金凯德年级水平(FKGL)、弗莱什·金凯德阅读易读性(FKRE)、烟雾指数和自动可读性指数(ARI)等指标综合评估文本的可读性和易读性。

结果

在质量评估方面,在DISCERN-AI量表的10项指标中,只有总体内容质量得分显示出统计学显著差异(P = 0.001),DeepSeek-R1表现最佳,中位数得分为5(5,5)。在可读性方面,DeepSeek-R1的平均单词计数和音节计数最高,P值均为0.000。这3个模型在FKGL、FKRE或自动可读性指数方面没有显著差异。具体而言,DeepSeek-R1的平均FKGL得分为14.08,OpenAI-O3为14.1,Claude-4-sonnet为13.31。烟雾评估显示Claude-4-sonnet的可读性最强,平均值为11,P值为0.028。

结论

DeepSeek-R1在内容生成方面总体质量最高,其次是Claude-4-sonnet。使用FKGL、烟雾指数和ARI进行的评估均表明Claude-4-sonnet的可读性最佳。鉴于质量和可读性的提高可以增强患者参与度并减轻焦虑,这2个模型应优先用于患者教育应用。未来的工作应集中在整合这些优势,以开发更可靠的大规模医学语言模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验