Hakam Hassan Tarek, Prill Robert, Korte Lisa, Lovreković Bruno, Ostojić Marko, Ramadanov Nikolai, Muehlensiepen Felix
Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Brandenburg an der Havel, Germany.
Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany.
JMIR Form Res. 2024 Feb 16;8:e52164. doi: 10.2196/52164.
As large language models (LLMs) are becoming increasingly integrated into different aspects of health care, questions about the implications for medical academic literature have begun to emerge. Key aspects such as authenticity in academic writing are at stake with artificial intelligence (AI) generating highly linguistically accurate and grammatically sound texts.
The objective of this study is to compare human-written with AI-generated scientific literature in orthopedics and sports medicine.
Five original abstracts were selected from the PubMed database. These abstracts were subsequently rewritten with the assistance of 2 LLMs with different degrees of proficiency. Subsequently, researchers with varying degrees of expertise and with different areas of specialization were asked to rank the abstracts according to linguistic and methodological parameters. Finally, researchers had to classify the articles as AI generated or human written.
Neither the researchers nor the AI-detection software could successfully identify the AI-generated texts. Furthermore, the criteria previously suggested in the literature did not correlate with whether the researchers deemed a text to be AI generated or whether they judged the article correctly based on these parameters.
The primary finding of this study was that researchers were unable to distinguish between LLM-generated and human-written texts. However, due to the small sample size, it is not possible to generalize the results of this study. As is the case with any tool used in academic research, the potential to cause harm can be mitigated by relying on the transparency and integrity of the researchers. With scientific integrity at stake, further research with a similar study design should be conducted to determine the magnitude of this issue.
随着大语言模型(LLMs)越来越多地融入医疗保健的各个方面,关于其对医学学术文献影响的问题开始出现。人工智能(AI)生成的文本在语言准确性和语法正确性方面都很高,这使得学术写作中的真实性等关键方面受到威胁。
本研究的目的是比较骨科和运动医学领域中人工撰写与人工智能生成的科学文献。
从PubMed数据库中选取了五篇原始摘要。随后,在两个不同熟练程度的大语言模型的协助下对这些摘要进行了改写。随后,要求具有不同专业水平和不同专业领域的研究人员根据语言和方法学参数对摘要进行排名。最后,研究人员必须将文章分类为人工智能生成或人工撰写。
研究人员和人工智能检测软件都无法成功识别出人工智能生成的文本。此外,文献中先前提出的标准与研究人员是否认为一篇文本是由人工智能生成的,或者他们是否根据这些参数正确判断文章无关。
本研究的主要发现是,研究人员无法区分大语言模型生成的文本和人工撰写的文本。然而,由于样本量较小,无法将本研究的结果推广。与学术研究中使用的任何工具一样,依靠研究人员的透明度和诚信可以减轻造成伤害的可能性。鉴于科学诚信受到威胁,应该进行类似研究设计的进一步研究,以确定这个问题的严重程度。