• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大型预训练语言模型识别在线健康信息:混合方法研究。

Identification of Online Health Information Using Large Pretrained Language Models: Mixed Methods Study.

作者信息

Tan Dongmei, Huang Yi, Liu Ming, Li Ziyu, Wu Xiaoqian, Huang Cheng

机构信息

College of Medical Informatics, Chongqing Medical University, Chongqing, China.

Human Resources Department, Army Medical Center, Army Medical University (The Third Military Medical University), Chongqing, China.

出版信息

J Med Internet Res. 2025 May 14;27:e70733. doi: 10.2196/70733.

DOI:10.2196/70733
PMID:40367512
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12120363/
Abstract

BACKGROUND

Online health information is widely available, but a substantial portion of it is inaccurate or misleading, including exaggerated, incomplete, or unverified claims. Such misinformation can significantly influence public health decisions and pose serious challenges to health care systems. With advances in artificial intelligence and natural language processing, pretrained large language models (LLMs) have shown promise in identifying and distinguishing misleading health information, although their effectiveness in this area remains underexplored.

OBJECTIVE

This study aimed to evaluate the performance of 4 mainstream LLMs (ChatGPT-3.5, ChatGPT-4, Ernie Bot, and iFLYTEK Spark) in the identification of online health information, providing empirical evidence for their practical application in this field.

METHODS

Web scraping was used to collect data from rumor-refuting websites, resulting in 2708 samples of online health information, including both true and false claims. The 4 LLMs' application programming interfaces were used for authenticity verification, with expert results as benchmarks. Model performance was evaluated using semantic similarity, accuracy, recall, F-score, content analysis, and credibility.

RESULTS

This study found that the 4 models performed well in identifying online health information. Among them, ChatGPT-4 achieved the highest accuracy at 87.27%, followed by Ernie Bot at 87.25%, iFLYTEK Spark at 87%, and ChatGPT-3.5 at 81.82%. Furthermore, text length and semantic similarity analysis showed that Ernie Bot had the highest similarity to expert texts, whereas ChatGPT-4 showed good overall consistency in its explanations. In addition, the credibility assessment results indicated that ChatGPT-4 provided the most reliable evaluations. Further analysis suggested that the highest misjudgment probabilities with respect to the LLMs occurred within the topics of food and maternal-infant nutrition management and nutritional science and food controversies. Overall, the research suggests that LLMs have potential in online health information identification; however, their understanding of certain specialized health topics may require further improvement.

CONCLUSIONS

The results demonstrate that, while these models show potential in providing assistance, their performance varies significantly in terms of accuracy, semantic understanding, and cultural adaptability. The principal findings highlight the models' ability to generate accessible and context-aware explanations; however, they fall short in areas requiring specialized medical knowledge or updated data, particularly for emerging health issues and context-sensitive scenarios. Significant discrepancies were observed in the models' ability to distinguish scientifically verified knowledge from popular misconceptions and in their stability when processing complex linguistic and cultural contexts. These challenges reveal the importance of refining training methodologies to improve the models' reliability and adaptability. Future research should focus on enhancing the models' capability to manage nuanced health topics and diverse cultural and linguistic nuances, thereby facilitating their broader adoption as reliable tools for online health information identification.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/f7e7771cb3b2/jmir_v27i1e70733_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/b926ce878fee/jmir_v27i1e70733_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/e27c326740d7/jmir_v27i1e70733_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/c1b8d47fbccf/jmir_v27i1e70733_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/bd434bc76b49/jmir_v27i1e70733_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/ad63d842e95d/jmir_v27i1e70733_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/161cbf946498/jmir_v27i1e70733_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/f7e7771cb3b2/jmir_v27i1e70733_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/b926ce878fee/jmir_v27i1e70733_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/e27c326740d7/jmir_v27i1e70733_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/c1b8d47fbccf/jmir_v27i1e70733_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/bd434bc76b49/jmir_v27i1e70733_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/ad63d842e95d/jmir_v27i1e70733_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/161cbf946498/jmir_v27i1e70733_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde9/12120363/f7e7771cb3b2/jmir_v27i1e70733_fig7.jpg
摘要

背景

在线健康信息广泛可得,但其中很大一部分是不准确或具有误导性的,包括夸大、不完整或未经证实的说法。此类错误信息会显著影响公众健康决策,并给医疗保健系统带来严峻挑战。随着人工智能和自然语言处理技术的进步,预训练的大语言模型(LLMs)在识别和区分误导性健康信息方面显示出了潜力,尽管其在该领域的有效性仍未得到充分探索。

目的

本研究旨在评估4种主流大语言模型(ChatGPT - 3.5、ChatGPT - 4、文心一言和讯飞星火)在识别在线健康信息方面的性能,为其在该领域的实际应用提供实证依据。

方法

通过网络爬虫从辟谣网站收集数据,得到2708个在线健康信息样本,包括真假说法。使用这4种大语言模型的应用程序编程接口进行真实性验证,以专家结果作为基准。使用语义相似度、准确率、召回率、F值、内容分析和可信度来评估模型性能。

结果

本研究发现这4种模型在识别在线健康信息方面表现良好。其中,ChatGPT - 4的准确率最高,为87.27%,其次是文心一言,为87.25%,讯飞星火为87%,ChatGPT - 3.5为81.82%。此外,文本长度和语义相似度分析表明,文心一言与专家文本的相似度最高,而ChatGPT - 4在其解释中显示出良好的整体一致性。此外,可信度评估结果表明ChatGPT - 4提供了最可靠的评估。进一步分析表明,大语言模型在食品以及母婴营养管理和营养科学与食品争议等主题方面的误判概率最高。总体而言,该研究表明大语言模型在在线健康信息识别方面具有潜力;然而,它们对某些专业健康主题的理解可能需要进一步改进。

结论

结果表明,虽然这些模型在提供帮助方面显示出潜力,但其在准确性、语义理解和文化适应性方面的表现差异很大。主要研究结果突出了模型生成易懂且上下文感知解释的能力;然而,它们在需要专业医学知识或最新数据的领域存在不足,特别是对于新出现的健康问题和上下文敏感场景。在区分科学验证的知识与普遍误解的能力以及处理复杂语言和文化背景时的稳定性方面,模型存在显著差异。这些挑战揭示了改进训练方法以提高模型可靠性和适应性的重要性。未来的研究应专注于增强模型处理细微健康主题以及多样文化和语言细微差别的能力,从而促进它们作为在线健康信息识别可靠工具的更广泛应用。

相似文献

1
Identification of Online Health Information Using Large Pretrained Language Models: Mixed Methods Study.使用大型预训练语言模型识别在线健康信息:混合方法研究。
J Med Internet Res. 2025 May 14;27:e70733. doi: 10.2196/70733.
2
Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.中文自闭症患者网络问诊中,医生与大型语言模型聊天机器人回复的对比分析:横断面研究。
J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.
3
Large language model comparisons between English and Chinese query performance for cardiovascular prevention.心血管疾病预防中英查询性能的大语言模型比较。
Commun Med (Lond). 2025 May 16;5(1):177. doi: 10.1038/s43856-025-00802-0.
4
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性:ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。
J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.
5
Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study.比较ChatGPT和文心一言在中英文语境下回答肝癌介入放射学相关问题的性能:一项比较研究。
Digit Health. 2025 Jan 23;11:20552076251315511. doi: 10.1177/20552076251315511. eCollection 2025 Jan-Dec.
6
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.大型语言模型在新冠肺炎对妊娠影响方面的熟练度、清晰度和客观性与专家知识对比:横断面试点研究
JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.
7
Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.用于评估医学领域大语言模型回复的数据集和基准(MedGPTEval):评估开发与验证
JMIR Med Inform. 2024 Jun 28;12:e57674. doi: 10.2196/57674.
8
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型:平台开发与综合分析
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.
9
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
10
Examining the Role of Large Language Models in Orthopedics: Systematic Review.检查大型语言模型在骨科中的作用:系统评价。
J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

本文引用的文献

1
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估:一项系统综述。
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
2
Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力:德尔菲研究。
J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.
3
False Health Claims Abound, but Physicians Are Still the Most Trusted Source for Health Information.虚假健康声明比比皆是,但医生仍然是健康信息最值得信赖的来源。
JAMA. 2024 May 21;331(19):1612-1613. doi: 10.1001/jama.2024.6837.
4
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
5
Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information.ChatGPT 3.5 和 4.0 作为 COVID-19 信息来源的潜力和局限性:生成信息和权威信息的综合比较分析。
J Med Internet Res. 2023 Dec 14;25:e49771. doi: 10.2196/49771.
6
Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation.ChatGPT-4 在回答巴西医学学位再认证国家考试问题方面的表现。
Rev Assoc Med Bras (1992). 2023 Sep 25;69(10):e20230848. doi: 10.1590/1806-9282.20230848. eCollection 2023.
7
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
8
Role of Chat GPT in Public Health.Chat GPT 在公共卫生中的作用。
Ann Biomed Eng. 2023 May;51(5):868-869. doi: 10.1007/s10439-023-03172-7. Epub 2023 Mar 15.
9
Examining the impact of sharing COVID-19 misinformation online on mental health.考察网络传播新冠病毒错误信息对心理健康的影响。
Sci Rep. 2022 May 16;12(1):8045. doi: 10.1038/s41598-022-11488-y.
10
The effect of self-limiting on the prevention and control of the diffuse COVID-19 epidemic with delayed and temporal-spatial heterogeneous.自我限制对延迟和时空异质弥漫性 COVID-19 疫情的预防和控制的影响。
BMC Infect Dis. 2021 Nov 9;21(1):1145. doi: 10.1186/s12879-021-06670-y.