Erol Gökberk, Ergen Anıl, Gülşen Erol Büşra, Kaya Ergen Şebnem, Bora Tevfik Serhan, Çölgeçen Ali Deniz, Araz Büşra, Şahin Cansel, Bostancı Günsu, Kılıç İlayda, Macit Zeynep Birce, Sevgi Umut Tan, Güngör Abuzer
Department of Neurosurgery, Adiyaman Training and Research Hospital, Adiyaman, Türkiye.
Department of Neurosurgery, Derince Training and Research Hospital, Kocaeli, Türkiye.
Acta Neurochir (Wien). 2025 Aug 7;167(1):214. doi: 10.1007/s00701-025-06622-4.
OBJECTIVE: This study evaluates the reliability and accuracy of AI-generated text detection tools in distinguishing human-authored academic content from AI-generated texts, highlighting potential challenges and ethical considerations in their application within the scientific community. METHODS: This study analyzed the detectability of AI-generated academic content using abstracts and introductions created by ChatGPT versions 3.5, 4, and 4o, alongside human-written originals from the pre-ChatGPT era. Articles were sourced from four high impact neurosurgery journals and categorized into four categories: originals and generated by ChatGPT 3.5, ChatGPT 4, and ChatGPT 4o. AI-output detectors (GPTZero, ZeroGPT, Corrector App) were employed to classify 1,000 texts as human- or AI-generated. Additionally, plagiarism checks were performed on AI-generated content to evaluate uniqueness. RESULTS: A total of 250 human-authored articles and 750 ChatGPT-generated texts were analyzed using three AI-output detectors (Corrector, ZeroGPT, GPTZero). Human-authored texts consistently had the lowest AI likelihood scores, while AI-generated texts exhibited significantly higher scores across all versions of ChatGPT (p < 0.01). Plagiarism detection revealed high originality for ChatGPT-generated content, with no significant differences among versions (p > 0.05). ROC analysis demonstrated that AI-output detectors effectively distinguished AI-generated content from human-written texts, with areas under the curve (AUC) ranging from 0.75 to 1.00 for all models. However, none of the detectors achieved 100% reliability in distinguishing AI-generated content. CONCLUSIONS: While models like ChatGPT enhance content creation and efficiency, they raise ethical concerns, particularly in fields demanding trust and precision. AI-output detectors exhibit moderate to high success in distinguishing AI-generated texts, but false positives pose risks to researchers. Improving detector reliability and establishing clear policies on AI usage are critical to mitigate misuse while fully leveraging AI's benefits.
目的:本研究评估人工智能生成文本检测工具在区分人类撰写的学术内容与人工智能生成的文本方面的可靠性和准确性,突出其在科学界应用中的潜在挑战和伦理考量。 方法:本研究使用ChatGPT 3.5、4和4o版本生成的摘要和引言,以及ChatGPT时代之前人类撰写的原文,分析人工智能生成的学术内容的可检测性。文章来源于四种高影响力的神经外科期刊,并分为四类:原文以及由ChatGPT 3.5、ChatGPT 4和ChatGPT 4o生成的文章。使用人工智能输出检测器(GPTZero、ZeroGPT、校正器应用程序)将1000篇文本分类为人类生成或人工智能生成。此外,对人工智能生成的内容进行剽窃检查以评估其独特性。 结果:使用三种人工智能输出检测器(校正器、ZeroGPT、GPTZero)对总共250篇人类撰写的文章和750篇ChatGPT生成的文本进行了分析。人类撰写的文本始终具有最低的人工智能可能性得分,而在ChatGPT的所有版本中,人工智能生成的文本得分显著更高(p < 0.01)。剽窃检测显示ChatGPT生成的内容具有很高的原创性,各版本之间无显著差异(p > 0.05)。ROC分析表明,人工智能输出检测器能够有效地区分人工智能生成的内容与人类撰写的文本,所有模型的曲线下面积(AUC)范围为0.75至1.00。然而,没有一个检测器在区分人工智能生成的内容方面达到100%的可靠性。 结论:虽然像ChatGPT这样的模型提高了内容创作和效率,但它们引发了伦理问题,尤其是在需要信任和精确性的领域。人工智能输出检测器在区分人工智能生成的文本方面表现出中等至高的成功率,但误报对研究人员构成风险。提高检测器的可靠性并制定明确的人工智能使用政策对于减轻滥用风险同时充分利用人工智能的益处至关重要。
Acta Neurochir (Wien). 2025-8-7
AJNR Am J Neuroradiol. 2025-3-4
J Korean Med Sci. 2025-6-16
J Med Imaging Radiat Sci. 2025-3-28
Curr Pharm Teach Learn. 2025-7-7
World Neurosurg. 2024-8
Healthcare (Basel). 2024-4-13
JMIR Med Educ. 2023-12-28