中文语境下三种大语言模型对堕胎后护理相关询问的回应表现评估：一项对比分析

Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.

作者信息

Xue Danyue, Liao Sha

机构信息

Department of Operating Room Nursing, West China second University Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.

Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, Sichuan, People's Republic of China.

出版信息

Risk Manag Healthc Policy. 2025 Aug 18;18:2731-2741. doi: 10.2147/RMHP.S531777. eCollection 2025.

DOI:10.2147/RMHP.S531777

PMID:40862290

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12372831/

Abstract

BACKGROUND

This study aimed to evaluate the response performances of three large language models (LLMs) (ChatGPT, Kimi, and Ernie Bot) to inquiries regarding post-abortion care (PAC) in the context of the Chinese language.

METHODS

The data was collected in October 2024. Twenty questions concerning the necessity of contraception after induced abortion, the best time for contraception, choice of a contraceptive method, contraceptive effectiveness, and the potential impact of contraception on fertility were used in this study. Each question was asked three times in Chinese for each LLM. Three PAC consultants conducted the evaluations. A Likert scale was used to score the responses based on accuracy, relevance, completeness, clarity, and reliability.

RESULTS

The number of responses received "good" (a mean score > 4), "average" (3 < mean score ≤ 4), and "poor" (a mean score ≤ 3) in overall evaluation was 159 (88.30%), 19 (10.57%), and 2 (1.10%). No statistically significant differences were identified in the overall evaluation among the three LLMs ( = 0.352). The number of the responses evaluated as good for accuracy, relevance, completeness, clarity, and reliability were 87 (48.33%), 154 (85.53%), 136 (75.57%), 133 (73.87%), and 128 (71.10%), respectively. No statistically significant differences were identified in accuracy, relevance, completeness or clarity between the three LLMs. A statistically significant difference was identified in reliability ( < 0.001).

CONCLUSION

The three LLMs performed well overall and showed great potential for application in PAC consultations. The accuracy of the LLMs' responses should be improved through continuous training and evaluation.

摘要

背景

本研究旨在评估三种大语言模型（ChatGPT、豆包和文心一言）在中文语境下对人工流产后护理（PAC）相关询问的回答表现。

方法

数据于2024年10月收集。本研究使用了20个关于人工流产后避孕的必要性、最佳避孕时间、避孕方法的选择、避孕效果以及避孕对生育的潜在影响的问题。每个问题针对每个大语言模型用中文询问三次。由三名PAC顾问进行评估。采用李克特量表根据准确性、相关性、完整性、清晰度和可靠性对回答进行评分。

结果

总体评估中获得“好”（平均得分>4）、“一般”（3<平均得分≤4）和“差”（平均得分≤3）的回答数量分别为159（88.30%）、19（10.57%）和2（1.10%）。三个大语言模型在总体评估中未发现统计学上的显著差异（P = 0.352）。在准确性、相关性、完整性、清晰度和可靠性方面被评估为好的回答数量分别为87（48.33%）、154（85.53%）、136（75.57%）、133（73.87%）和128（71.10%）。三个大语言模型在准确性、相关性、完整性或清晰度方面未发现统计学上的显著差异。在可靠性方面发现了统计学上的显著差异（P<0.001）。

结论

这三个大语言模型总体表现良好，在PAC咨询中显示出巨大的应用潜力。应通过持续训练和评估提高大语言模型回答的准确性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

中文语境下三种大语言模型对堕胎后护理相关询问的回应表现评估：一项对比分析

Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

本文引用的文献

中文语境下三种大语言模型对堕胎后护理相关询问的回应表现评估：一项对比分析

Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

本文引用的文献