Suppr超能文献

评估使用ChatGPT评估健康新闻质量的准确性和可解释性。

Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news.

作者信息

Liu Xiaoyu, He Lu, Alanazi Eman, Liu Echu, Goss Arianna, Gumireddy Lionel

机构信息

College for Public Health and Social Justice, Saint Louis University, St. Louis, USA.

Zilber College of Public Health, University of Wisconsin-Milwaukee, Milwaukee, USA.

出版信息

BMC Public Health. 2025 Jun 2;25(1):2038. doi: 10.1186/s12889-025-23206-0.

Abstract

BACKGROUND

With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large language model (LLM), in rating the quality of health news and providing explanatory justification for the rating assessment.

METHODS

We evaluated GPT-3.5-Turbo’s performance on 3222 health news articles from an expert-annotated dataset compiled by HealthNewsReview.org, which assesses the quality of health news across nine criteria. GPT-3.5-Turbo was prompted with standardized queries tailored to each criterion. We measured its rating performance using 95% confidence intervals for precision, recall, and F1 scores in binary classification (satisfactory/not satisfactory). Additionally, linguistic complexity, readability, and the quality of GPT-3.5-Turbo’s explainability were assessed through both quantitative linguistic analysis and qualitative evaluation of consistency and contextual relevance.

RESULTS

GPT-3.5-Turbo’s rating performance varied across criteria, with the highest accuracy for the Cost criterion (F1 = 0.824) but lower accuracy for Benefit, Conflict, and Quality criteria (F1 < 0.5), underperforming compared to traditional supervised machine learning models. However, its explanations were clear, with readability suited to late high school or early college levels and scored highly for consistency (average score: 2.90/3) and contextual relevance (average score: 2.73/3). These findings highlight GPT-3.5-Turbo’s strength in providing understandable and contextually relevant explanations, despite that its rating accuracy is limited.

CONCLUSION

While GPT-3.5-Turbo’s rating accuracy requires improvement, its strength in offering comprehensible and contextually relevant explanations presents a valuable opportunity to enhance public understanding of health news quality. Leveraging LLMs as complementary tools for health literacy initiatives could help mitigate misinformation by facilitating non-expert audiences to interpret and assess health information.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s12889-025-23206-0.

摘要

背景

随着网络上健康错误信息的日益普遍,迫切需要能够可靠地帮助公众评估健康信息质量的工具。本研究调查了具有代表性且广泛使用的大语言模型GPT-3.5-Turbo在对健康新闻质量进行评级并为评级评估提供解释性依据方面的表现。

方法

我们在由HealthNewsReview.org编制的专家注释数据集中的3222篇健康新闻文章上评估了GPT-3.5-Turbo的性能,该数据集根据九个标准评估健康新闻的质量。针对每个标准,使用标准化查询提示GPT-3.5-Turbo。我们在二元分类(满意/不满意)中使用95%置信区间来衡量其评级性能的精确率、召回率和F1分数。此外,通过定量语言分析以及对一致性和上下文相关性的定性评估,对GPT-3.5-Turbo的语言复杂性、可读性和可解释性质量进行了评估。

结果

GPT-3.5-Turbo的评级性能因标准而异,成本标准的准确率最高(F1 = 0.824),但益处、冲突和质量标准的准确率较低(F1 < 0.5),与传统的监督机器学习模型相比表现较差。然而,它的解释清晰,可读性适合高中后期或大学早期水平,在一致性(平均得分:2.90/3)和上下文相关性(平均得分:2.73/3)方面得分较高。这些发现突出了GPT-3.5-Turbo在提供可理解且与上下文相关的解释方面的优势,尽管其评级准确性有限。

结论

虽然GPT-3.5-Turbo的评级准确性有待提高,但其在提供可理解且与上下文相关的解释方面的优势为增强公众对健康新闻质量的理解提供了宝贵机会。利用大语言模型作为健康素养倡议的补充工具,可以通过帮助非专业受众解释和评估健康信息来减少错误信息。

补充信息

在线版本包含可在10.1186/s12889-025-23206-0获取的补充材料。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9fa/12128262/93412efa21cd/12889_2025_23206_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验