Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA.
Int J Med Robot. 2024 Feb;20(1):e2621. doi: 10.1002/rcs.2621.
Large language models (LLM) have unknown implications for medical research. This study assessed whether LLM-generated abstracts are distinguishable from human-written abstracts and to compare their perceived quality.
The LLM ChatGPT was used to generate 20 arthroplasty abstracts (AI-generated) based on full-text manuscripts, which were compared to originally published abstracts (human-written). Six blinded orthopaedic surgeons rated abstracts on overall quality, communication, and confidence in the authorship source. Authorship-confidence scores were compared to a test value representing complete inability to discern authorship.
Modestly increased confidence in human authorship was observed for human-written abstracts compared with AI-generated abstracts (p = 0.028), though AI-generated abstract authorship-confidence scores were statistically consistent with inability to discern authorship (p = 0.999). Overall abstract quality was higher for human-written abstracts (p = 0.019).
AI-generated abstracts' absolute authorship-confidence ratings demonstrated difficulty in discerning authorship but did not achieve the perceived quality of human-written abstracts. Caution is warranted in implementing LLMs into scientific writing.
大型语言模型(LLM)对医学研究有未知的影响。本研究评估了 LLM 生成的摘要是否与人工编写的摘要有区别,并比较了它们的感知质量。
使用 LLM ChatGPT 根据全文手稿生成 20 篇关节置换术摘要(AI 生成),并与原始发表的摘要(人工编写)进行比较。六名盲法骨科医生根据整体质量、沟通和对作者来源的信心对摘要进行评分。将作者信心评分与代表完全无法辨别作者身份的测试值进行比较。
与 AI 生成的摘要相比,人工编写的摘要的作者身份信心略有增加(p=0.028),尽管 AI 生成的摘要作者身份信心评分在统计学上与无法辨别作者身份一致(p=0.999)。人工编写的摘要的整体摘要质量更高(p=0.019)。
AI 生成的摘要的绝对作者身份信心评分表明难以辨别作者身份,但未能达到人工编写摘要的感知质量。在将 LLM 应用于科学写作时需要谨慎。