Department of Surgery, Division of Surgical Oncology, Medical College of Wisconsin, Milwaukee, WI, USA.
Department of Surgery, Division of Trauma, Critical Care, and Acute Care Surgery, Duke University, Durham, NC, USA.
Ann Surg Oncol. 2024 Oct;31(10):6387-6393. doi: 10.1245/s10434-024-15549-6. Epub 2024 Jun 22.
Few studies have examined the performance of artificial intelligence (AI) content detection in scientific writing. This study evaluates the performance of publicly available AI content detectors when applied to both human-written and AI-generated scientific articles.
Articles published in Annals of Surgical Oncology (ASO) during the year 2022, as well as AI-generated articles using OpenAI's ChatGPT, were analyzed by three AI content detectors to assess the probability of AI-generated content. Full manuscripts and their individual sections were evaluated. Group comparisons and trend analyses were conducted by using ANOVA and linear regression. Classification performance was determined using area under the curve (AUC).
A total of 449 original articles met inclusion criteria and were evaluated to determine the likelihood of being generated by AI. Each detector also evaluated 47 AI-generated articles by using titles from ASO articles. Human-written articles had an average probability of being AI-generated of 9.4% with significant differences between the detectors. Only two (0.4%) human-written manuscripts were detected as having a 0% probability of being AI-generated by all three detectors. Completely AI-generated articles were evaluated to have a higher average probability of being AI-generated (43.5%) with a range from 12.0 to 99.9%.
This study demonstrates differences in the performance of various AI content detectors with the potential to label human-written articles as AI-generated. Any effort toward implementing AI detectors must include a strategy for continuous evaluation and validation as AI models and detectors rapidly evolve.
很少有研究检查人工智能 (AI) 内容检测在科学写作中的性能。本研究评估了三种公开可用的 AI 内容检测器在应用于人类撰写和 AI 生成的科学文章时的性能。
分析了 2022 年发表在《外科肿瘤学年鉴》(Annals of Surgical Oncology,ASO)上的文章以及使用 OpenAI 的 ChatGPT 生成的 AI 文章,以评估 AI 生成内容的可能性。评估了全文及其各个部分。使用 ANOVA 和线性回归进行组间比较和趋势分析。使用曲线下面积 (AUC) 确定分类性能。
共有 449 篇原始文章符合纳入标准,并评估了它们被 AI 生成的可能性。每个检测器还使用来自 ASO 文章的标题评估了 47 篇 AI 生成的文章。人类撰写的文章被 AI 生成的平均概率为 9.4%,各检测器之间存在显著差异。只有两篇(0.4%)人类撰写的手稿被所有三个检测器检测为 AI 生成的概率为 0%。完全 AI 生成的文章被评估为具有更高的 AI 生成平均概率(43.5%),范围为 12.0%至 99.9%。
本研究表明,各种 AI 内容检测器的性能存在差异,有可能将人类撰写的文章标记为 AI 生成。任何实施 AI 检测器的努力都必须包括一项策略,以随着 AI 模型和检测器的快速发展,不断评估和验证。