中文自闭症患者网络问诊中，医生与大型语言模型聊天机器人回复的对比分析：横断面研究。

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.

机构信息

Tianjin University of Traditional Chinese Medicine, Tianjin, China.

Dongguan Rehabilitation Experimental School, Dongguan, China.

出版信息

J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.

DOI:10.2196/54706

PMID:38687566

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11094593/

Abstract

BACKGROUND

There is a dearth of feasibility assessments regarding using large language models (LLMs) for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant research focus on applying these models in the medical field has been on English-speaking populations.

OBJECTIVE

This study aims to assess the effectiveness of LLM chatbots, specifically ChatGPT-4 (OpenAI) and ERNIE Bot (version 2.2.3; Baidu, Inc), one of the most advanced LLMs in China, in addressing inquiries from autistic individuals in a Chinese setting.

METHODS

For this study, we gathered data from DXY-a widely acknowledged, web-based, medical consultation platform in China with a user base of over 100 million individuals. A total of 100 patient consultation samples were rigorously selected from January 2018 to August 2023, amounting to 239 questions extracted from publicly available autism-related documents on the platform. To maintain objectivity, both the original questions and responses were anonymized and randomized. An evaluation team of 3 chief physicians assessed the responses across 4 dimensions: relevance, accuracy, usefulness, and empathy. The team completed 717 evaluations. The team initially identified the best response and then used a Likert scale with 5 response categories to gauge the responses, each representing a distinct level of quality. Finally, we compared the responses collected from different sources.

RESULTS

Among the 717 evaluations conducted, 46.86% (95% CI 43.21%-50.51%) of assessors displayed varying preferences for responses from physicians, with 34.87% (95% CI 31.38%-38.36%) of assessors favoring ChatGPT and 18.27% (95% CI 15.44%-21.10%) of assessors favoring ERNIE Bot. The average relevance scores for physicians, ChatGPT, and ERNIE Bot were 3.75 (95% CI 3.69-3.82), 3.69 (95% CI 3.63-3.74), and 3.41 (95% CI 3.35-3.46), respectively. Physicians (3.66, 95% CI 3.60-3.73) and ChatGPT (3.73, 95% CI 3.69-3.77) demonstrated higher accuracy ratings compared to ERNIE Bot (3.52, 95% CI 3.47-3.57). In terms of usefulness scores, physicians (3.54, 95% CI 3.47-3.62) received higher ratings than ChatGPT (3.40, 95% CI 3.34-3.47) and ERNIE Bot (3.05, 95% CI 2.99-3.12). Finally, concerning the empathy dimension, ChatGPT (3.64, 95% CI 3.57-3.71) outperformed physicians (3.13, 95% CI 3.04-3.21) and ERNIE Bot (3.11, 95% CI 3.04-3.18).

CONCLUSIONS

In this cross-sectional study, physicians' responses exhibited superiority in the present Chinese-language context. Nonetheless, LLMs can provide valuable medical guidance to autistic patients and may even surpass physicians in demonstrating empathy. However, it is crucial to acknowledge that further optimization and research are imperative prerequisites before the effective integration of LLMs in clinical settings across diverse linguistic environments can be realized.

TRIAL REGISTRATION

Chinese Clinical Trial Registry ChiCTR2300074655; https://www.chictr.org.cn/bin/project/edit?pid=199432.

摘要

背景

在中文语境下，使用大型语言模型（LLM）来回应自闭症患者的咨询，可行性评估还很少。尽管中文是全球使用最广泛的语言之一，但这些模型在医学领域的应用主要集中在英语人群上。

目的

本研究旨在评估 LLM 聊天机器人，特别是 ChatGPT-4（OpenAI）和 ERNIE Bot（百度公司，版本 2.2.3）在中国，在处理中文环境下自闭症个体咨询方面的有效性。

方法

在这项研究中，我们从中国知名的在线医疗咨询平台 DXY 上收集了数据，该平台拥有超过 1 亿用户。从 2018 年 1 月至 2023 年 8 月，我们严格选择了 100 名患者咨询样本，从平台上公开的自闭症相关文档中提取了 239 个问题。为了保持客观性，我们对原始问题和回答进行了匿名和随机化处理。一个由 3 名主治医生组成的评估小组对 4 个维度的相关性、准确性、有用性和同理心进行了评估：相关性、准确性、有用性和同理心。评估小组共完成了 717 次评估。团队最初确定了最佳回答，然后使用了 5 个反应类别进行评估，每个类别代表不同的质量水平。最后，我们比较了不同来源的反应。

结果

在进行的 717 次评估中，46.86%（95%置信区间 43.21%-50.51%）的评估者对医生的回答有不同的偏好，其中 34.87%（95%置信区间 31.38%-38.36%）的评估者更喜欢 ChatGPT，18.27%（95%置信区间 15.44%-21.10%）的评估者更喜欢 ERNIE Bot。医生、ChatGPT 和 ERNIE Bot 的平均相关性评分分别为 3.75（95%置信区间 3.69-3.82）、3.69（95%置信区间 3.63-3.74）和 3.41（95%置信区间 3.35-3.46）。医生（3.66，95%置信区间 3.60-3.73）和 ChatGPT（3.73，95%置信区间 3.69-3.77）的准确性评分高于 ERNIE Bot（3.52，95%置信区间 3.47-3.57）。在有用性评分方面，医生（3.54，95%置信区间 3.47-3.62）的评分高于 ChatGPT（3.40，95%置信区间 3.34-3.47）和 ERNIE Bot（3.05，95%置信区间 2.99-3.12）。最后，在同理心维度方面，ChatGPT（3.64，95%置信区间 3.57-3.71）优于医生（3.13，95%置信区间 3.04-3.21）和 ERNIE Bot（3.11，95%置信区间 3.04-3.18）。

结论

在这项横断面研究中，医生的回答在目前的中文语境下具有优势。然而，LLM 可以为自闭症患者提供有价值的医疗指导，甚至在表现出同理心方面可能超过医生。但是，必须认识到，在不同语言环境下，有效的整合到临床环境中之前，还需要进行进一步的优化和研究。

试验注册

中国临床试验注册中心 ChiCTR2300074655；https://www.chictr.org.cn/bin/project/edit?pid=199432.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/784d/11094593/2b175d0b9b97/jmir_v26i1e54706_fig1.jpg

相似文献

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.

J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.

Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.

J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

Comparative performance analysis of global and chinese-domain large language models for myopia.

Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.

Prescription of Controlled Substances: Benefits and Risks

Application of Large Language Models in Stroke Rehabilitation Health Education: 2-Phase Study.

J Med Internet Res. 2025 Jul 22;27:e73226. doi: 10.2196/73226.

Comparison of preoperative education by artificial intelligence versus traditional physicians in perioperative management of urolithiasis surgery: a prospective single-blind randomized controlled trial conducted in China.

Front Med (Lausanne). 2025 Jun 25;12:1543630. doi: 10.3389/fmed.2025.1543630. eCollection 2025.

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

JMIR Med Inform. 2024 Jun 28;12:e57674. doi: 10.2196/57674.

Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.

Comparing physician and artificial intelligence chatbot responses to posthysterectomy questions posted to a public social media forum.

AJOG Glob Rep. 2025 Aug 5;5(3):100553. doi: 10.1016/j.xagr.2025.100553. eCollection 2025 Aug.

引用本文的文献

ChatGPT and human dietitian responses to diet-related questions on an online Q&A platform: A comparative study.

Digit Health. 2025 Aug 21;11:20552076251361381. doi: 10.1177/20552076251361381. eCollection 2025 Jan-Dec.

Implementation of generative AI for the assessment and treatment of autism spectrum disorders: a scoping review.

Front Psychiatry. 2025 Jul 22;16:1628216. doi: 10.3389/fpsyt.2025.1628216. eCollection 2025.

A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam.

J Cancer Educ. 2025 May 26. doi: 10.1007/s13187-025-02652-9.

The Applications of Large Language Models in Mental Health: Scoping Review.

J Med Internet Res. 2025 May 5;27:e69284. doi: 10.2196/69284.

AI-driven patient support: Evaluating the effectiveness of ChatGPT-4 in addressing queries about ovarian cancer compared with healthcare professionals in gynecologic oncology.

Support Care Cancer. 2025 Apr 1;33(4):337. doi: 10.1007/s00520-025-09389-7.

MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases.

J Am Med Inform Assoc. 2025 May 1;32(5):775-783. doi: 10.1093/jamia/ocaf034.

Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts.

J Clin Med. 2025 Jan 28;14(3):875. doi: 10.3390/jcm14030875.

Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis.

JMIR Med Inform. 2025 Jan 9;13:e63924. doi: 10.2196/63924.

Large Language Models for Mental Health Applications: Systematic Review.

JMIR Ment Health. 2024 Oct 18;11:e57400. doi: 10.2196/57400.

本文引用的文献

Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions.

JMIR Dermatol. 2024 Mar 6;7:e50163. doi: 10.2196/50163.

Can ChatGPT assist authors with abstract writing in medical journals? Evaluating the quality of scientific abstracts generated by ChatGPT and original abstracts.

PLoS One. 2024 Feb 14;19(2):e0297701. doi: 10.1371/journal.pone.0297701. eCollection 2024.

Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review.

Ann Intern Med. 2024 Feb;177(2):210-220. doi: 10.7326/M23-2772. Epub 2024 Jan 30.

Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.

J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

Opportunities and challenges for ChatGPT and large language models in biomedicine and health.

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.

The electronic health record Risk of Alzheimer's and Dementia Assessment Rule (eRADAR) Brain Health Trial: Protocol for an embedded, pragmatic clinical trial of a low-cost dementia detection algorithm.

Contemp Clin Trials. 2023 Dec;135:107356. doi: 10.1016/j.cct.2023.107356. Epub 2023 Oct 17.

Applications of large language models in cancer care: current evidence and future perspectives.

Front Oncol. 2023 Sep 4;13:1268915. doi: 10.3389/fonc.2023.1268915. eCollection 2023.

Leveraging ChatGPT in the Pediatric Neurology Clinic: Practical Considerations for Use to Improve Efficiency and Outcomes.

Pediatr Neurol. 2023 Nov;148:157-163. doi: 10.1016/j.pediatrneurol.2023.08.035. Epub 2023 Aug 29.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Appropriateness and Comprehensiveness of Using ChatGPT for Perioperative Patient Education in Thoracic Surgery in Different Language Contexts: Survey Study.

Interact J Med Res. 2023 Aug 14;12:e46900. doi: 10.2196/46900.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

中文自闭症患者网络问诊中，医生与大型语言模型聊天机器人回复的对比分析：横断面研究。

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

TRIAL REGISTRATION

背景

目的

方法

结果

结论

试验注册

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献