Suppr超能文献

人工智能与人类触感:人工智能能否准确生成激光技术的文献综述?

Artificial intelligence versus human touch: can artificial intelligence accurately generate a literature review on laser technologies?

机构信息

Department of Urology, Westmoreland Street Hospital, UCLH NHS Foundation Trust, 16-18 Westmoreland Street, Marylebone, London, W1G 8PH, UK.

Sorbonne University GRC Urolithiasis No. 20 Tenon Hospital, 75020, Paris, France.

出版信息

World J Urol. 2024 Oct 28;42(1):598. doi: 10.1007/s00345-024-05311-8.

Abstract

PURPOSE

To compare the accuracy of open-source Artificial Intelligence (AI) Large Language Models (LLM) against human authors to generate a systematic review (SR) on the new pulsed-Thulium:YAG (p-Tm:YAG) laser.

METHODS

Five manuscripts were compared. The Human-SR on p-Tm:YAG (considered to be the "ground truth") was written by independent certified endourologists with expertise in lasers, accepted in a peer-review pubmed-indexed journal (but not yet available online, and therefore not accessible to the LLMs). The query to the AI LLMs was: "write a systematic review on pulsed-Thulium:YAG laser for lithotripsy" which was submitted to four LLMs (ChatGPT3.5/Vercel/Claude/Mistral-7b). The LLM-SR were uniformed and Human-SR reformatted to fit the general output appearance, to ensure blindness. Nine participants with various levels of endourological expertise (three Clinical Nurse Specialist nurses, Urology Trainees and Consultants) objectively assessed the accuracy of the five SRs using a bespoke 10 "checkpoint" proforma. A subjective assessment was recorded using a composite score including quality (0-10), clarity (0-10) and overall manuscript rank (1-5).

RESULTS

The Human-SR was objectively and subjectively more accurate than LLM-SRs (96 ± 7% and 86.8 ± 8.2% respectively; p < 0.001). The LLM-SRs did not significantly differ but ChatGPT3.5 presented greater subjective and objective accuracy scores (62.4 ± 15% and 29 ± 28% respectively; p > 0.05). Quality and clarity assessments were significantly impacted by SR type but not the expertise level (p < 0.001 and > 0.05, respectively).

CONCLUSIONS

LLM generated data on highly technical topics present a lower accuracy than Key Opinion Leaders. LLMs, especially ChatGPT3.5, with human supervision could improve our practice.

摘要

目的

比较开源人工智能(AI)大语言模型(LLM)与人类作者生成新的脉冲铥钇铝石榴石(p-Tm:YAG)激光系统评价(SR)的准确性。

方法

比较了 5 篇手稿。p-Tm:YAG 的人类 SR(被认为是“真实情况”)由具有激光专业知识的独立认证的内镜泌尿科医生撰写,发表在同行评审的 PubMed 索引期刊上(但尚未在线提供,因此无法被 LLM 获取)。向 4 个 LLM(ChatGPT3.5/Vercel/Claude/Mistral-7b)提出的查询是:“撰写关于脉冲铥钇铝石榴石激光碎石术的系统评价”。将 LLM-SR 统一并将人类 SR 重新格式化以适应一般输出外观,以确保盲目性。9 名具有不同内镜泌尿科专业知识水平的参与者(3 名临床护士专家、泌尿科住院医师和顾问)使用专门的 10 个“检查点”表格客观评估了 5 篇 SR 的准确性。使用包括质量(0-10)、清晰度(0-10)和总体手稿等级(1-5)在内的综合评分记录了主观评估。

结果

人类 SR 在客观和主观上都比 LLM-SR 更准确(分别为 96±7%和 86.8±8.2%;p<0.001)。LLM-SR 之间没有显著差异,但 ChatGPT3.5 表现出更高的主观和客观准确性得分(分别为 62.4±15%和 29±28%;p>0.05)。质量和清晰度评估受到 SR 类型的显著影响,但不受专业水平的影响(p<0.001 和>0.05)。

结论

在高度技术主题上生成数据的 LLM 准确性低于主要意见领袖。在人类监督下,LLM,尤其是 ChatGPT3.5,可以改善我们的实践。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验