Suppr超能文献

比较大型语言模型ChatGPT、BARD和必应人工智能在提供隆鼻信息方面的功效:一项观察性研究。

Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study.

作者信息

Seth Ishith, Lim Bryan, Xie Yi, Cevik Jevan, Rozen Warren M, Ross Richard J, Lee Mathew

出版信息

Aesthet Surg J Open Forum. 2023 Sep 14;5:ojad084. doi: 10.1093/asjof/ojad084. eCollection 2023.

Abstract

BACKGROUND

Large language models (LLMs) are emerging artificial intelligence (AI) technologies refining research and healthcare. However, the impact of these models on presurgical planning and education remains under-explored.

OBJECTIVES

This study aims to assess 3 prominent LLMs-Google's AI BARD (Mountain View, CA), Bing AI (Microsoft, Redmond, WA), and ChatGPT-3.5 (Open AI, San Francisco, CA) in providing safe medical information for rhinoplasty.

METHODS

Six questions regarding rhinoplasty were prompted to ChatGPT, BARD, and Bing AI. A Likert scale was used to evaluate these responses by a panel of Specialist Plastic and Reconstructive Surgeons with extensive experience in rhinoplasty. To measure reliability, the Flesch Reading Ease Score, the Flesch-Kincaid Grade Level, and the Coleman-Liau Index were used. The modified DISCERN score was chosen as the criterion for assessing suitability and reliability. A test was performed to calculate the difference between the LLMs, and a double-sided -value <.05 was considered statistically significant.

RESULTS

In terms of reliability, BARD and ChatGPT demonstrated a significantly ( < .05) greater Flesch Reading Ease Score of 47.47 (±15.32) and 37.68 (±12.96), Flesch-Kincaid Grade Level of 9.7 (±3.12) and 10.15 (±1.84), and a Coleman-Liau Index of 10.83 (±2.14) and 12.17 (±1.17) than Bing AI. In terms of suitability, BARD (46.3 ± 2.8) demonstrated a significantly greater DISCERN score than ChatGPT and Bing AI. In terms of Likert score, ChatGPT and BARD demonstrated similar scores and were greater than Bing AI.

CONCLUSIONS

BARD delivered the most succinct and comprehensible information, followed by ChatGPT and Bing AI. Although these models demonstrate potential, challenges regarding their depth and specificity remain. Therefore, future research should aim to augment LLM performance through the integration of specialized databases and expert knowledge, while also refining their algorithms.

摘要

背景

大语言模型(LLMs)是正在兴起的人工智能(AI)技术,正在改进研究和医疗保健领域。然而,这些模型对术前规划和教育的影响仍未得到充分探索。

目的

本研究旨在评估3种著名的大语言模型——谷歌的人工智能BARD(加利福尼亚州山景城)、必应人工智能(华盛顿州雷德蒙德市微软公司)和ChatGPT-3.5(加利福尼亚州旧金山OpenAI公司)在提供安全的隆鼻手术医疗信息方面的表现。

方法

向ChatGPT、BARD和必应人工智能提出了6个关于隆鼻手术的问题。由一组在隆鼻手术方面有丰富经验的整形和重建外科专家使用李克特量表来评估这些回答。为了衡量可靠性,使用了弗莱什易读性得分、弗莱什-金凯德年级水平得分和科尔曼-廖指数。选择修改后的DISCERN得分作为评估适用性和可靠性的标准。进行了一项测试来计算大语言模型之间的差异,双侧P值<0.05被认为具有统计学意义。

结果

在可靠性方面,BARD和ChatGPT的弗莱什易读性得分显著更高(P<0.05),分别为47.47(±15.32)和37.68(±12.96);弗莱什-金凯德年级水平得分分别为9.7(±3.12)和10.15(±1.84);科尔曼-廖指数分别为10.83(±2.14)和12.17(±1.17),均高于必应人工智能。在适用性方面,BARD(46.3±2.8)的DISCERN得分显著高于ChatGPT和必应人工智能。在李克特得分方面,ChatGPT和BARD的得分相似且高于必应人工智能。

结论

BARD提供的信息最简洁易懂,其次是ChatGPT和必应人工智能。尽管这些模型显示出了潜力,但在深度和特异性方面仍存在挑战。因此,未来的研究应旨在通过整合专业数据库和专家知识来提高大语言模型的性能,同时也完善其算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ccd/10547367/79ddf0c3935e/ojad084f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验