Popkov Andrey A, Barrett Tyson S
Highmark Health, Pittsburgh, PA, USA.
Contigo Health, LLC, a subsidiary of Premier, Inc, Charlotte, NC, USA.
Account Res. 2024 Mar 22:1-17. doi: 10.1080/08989621.2024.2331757.
Artificial Intelligence (AI) language models continue to expand in both access and capability. As these models have evolved, the number of academic journals in medicine and healthcare which have explored policies regarding AI-generated text has increased. The implementation of such policies requires accurate AI detection tools. Inaccurate detectors risk unnecessary penalties for human authors and/or may compromise the effective enforcement of guidelines against AI-generated content. Yet, the accuracy of AI text detection tools in identifying human-written versus AI-generated content has been found to vary across published studies. This experimental study used a sample of behavioral health publications and found problematic false positive and false negative rates from both free and paid AI detection tools. The study assessed 100 research articles from 2016-2018 in behavioral health and psychiatry journals and 200 texts produced by AI chatbots (100 by "ChatGPT" and 100 by "Claude"). The free AI detector showed a median of 27.2% for the proportion of academic text identified as AI-generated, while commercial software Originality.AI demonstrated better performance but still had limitations, especially in detecting texts generated by Claude. These error rates raise doubts about relying on AI detectors to enforce strict policies around AI text generation in behavioral health publications.
人工智能(AI)语言模型在可及性和功能方面都在不断扩展。随着这些模型的发展,医学和医疗保健领域中探讨人工智能生成文本相关政策的学术期刊数量有所增加。此类政策的实施需要准确的人工智能检测工具。不准确的检测器可能会给人类作者带来不必要的惩罚,和/或可能会影响针对人工智能生成内容的指南的有效执行。然而,已发现人工智能文本检测工具在识别人类撰写内容与人工智能生成内容方面的准确性在已发表的研究中各不相同。这项实验研究以行为健康出版物为样本,发现免费和付费的人工智能检测工具都存在有问题的误报率和漏报率。该研究评估了2016年至2018年行为健康和精神病学期刊上的100篇研究文章以及人工智能聊天机器人生成的200篇文本(“ChatGPT”生成100篇,“Claude”生成100篇)。免费的人工智能检测器显示,被判定为人工智能生成的学术文本比例中位数为27.2%,而商业软件Originality.AI表现更好,但仍有局限性,尤其是在检测Claude生成的文本方面。这些错误率让人怀疑依靠人工智能检测器来执行行为健康出版物中关于人工智能文本生成的严格政策是否可行。