Cole Eye Institute, Cleveland Clinic Foundation, Cleveland, Ohio.
JAMA Ophthalmol. 2023 Sep 1;141(9):819-824. doi: 10.1001/jamaophthalmol.2023.3119.
Language-learning model-based artificial intelligence (AI) chatbots are growing in popularity and have significant implications for both patient education and academia. Drawbacks of using AI chatbots in generating scientific abstracts and reference lists, including inaccurate content coming from hallucinations (ie, AI-generated output that deviates from its training data), have not been fully explored.
To evaluate and compare the quality of ophthalmic scientific abstracts and references generated by earlier and updated versions of a popular AI chatbot.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional comparative study used 2 versions of an AI chatbot to generate scientific abstracts and 10 references for clinical research questions across 7 ophthalmology subspecialties. The abstracts were graded by 2 authors using modified DISCERN criteria and performance evaluation scores.
Scores for the chatbot-generated abstracts were compared using the t test. Abstracts were also evaluated by 2 AI output detectors. A hallucination rate for unverifiable references generated by the earlier and updated versions of the chatbot was calculated and compared.
The mean modified AI-DISCERN scores for the chatbot-generated abstracts were 35.9 and 38.1 (maximum of 50) for the earlier and updated versions, respectively (P = .30). Using the 2 AI output detectors, the mean fake scores (with a score of 100% meaning generated by AI) for the earlier and updated chatbot-generated abstracts were 65.4% and 10.8%, respectively (P = .01), for one detector and were 69.5% and 42.7% (P = .17) for the second detector. The mean hallucination rates for nonverifiable references generated by the earlier and updated versions were 33% and 29% (P = .74).
Both versions of the chatbot generated average-quality abstracts. There was a high hallucination rate of generating fake references, and caution should be used when using these AI resources for health education or academic purposes.
基于语言学习模型的人工智能(AI)聊天机器人越来越受欢迎,对患者教育和学术界都有重大影响。在生成科学摘要和参考列表方面使用 AI 聊天机器人的缺点,包括来自幻觉的不准确内容(即,与训练数据偏离的 AI 生成输出),尚未得到充分探讨。
评估和比较流行的 AI 聊天机器人的早期和更新版本生成的眼科科学摘要和参考文献的质量。
设计、设置和参与者:这项横断面比较研究使用了 2 个版本的 AI 聊天机器人,为 7 个眼科亚专业的临床研究问题生成科学摘要和 10 个参考文献。摘要由 2 位作者使用修改后的 DISCERN 标准和绩效评估评分进行评分。
使用 t 检验比较聊天机器人生成的摘要的分数。还使用 2 个 AI 输出检测器评估摘要。计算并比较了聊天机器人早期和更新版本生成的不可验证参考文献的幻觉率。
聊天机器人生成的摘要的平均修正 AI-DISCERN 分数分别为早期版本的 35.9 和更新版本的 38.1(最高 50 分)(P =.30)。使用 2 个 AI 输出检测器,早期和更新的聊天机器人生成的摘要的平均假分数(分数为 100%表示由 AI 生成)分别为 65.4%和 10.8%(P =.01),对于一个检测器和分别为 69.5%和 42.7%(P =.17)对于第二个检测器。早期和更新版本生成的不可验证参考文献的平均幻觉率分别为 33%和 29%(P =.74)。
聊天机器人的两个版本都生成了平均质量的摘要。生成虚假参考文献的幻觉率很高,在将这些 AI 资源用于健康教育或学术目的时应谨慎使用。