Kasakewitch Joao P G, Lima Diego L, Balthazar da Silveira Carlos A, Sanha Valberto, Rasador Ana Caroline, Cavazzola Leandro Totti, Mayol Julio, Malcher Flavio
Department of Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA.
Department of Surgery, Montefiore Medical Center, The Bronx, New York, USA.
J Laparoendosc Adv Surg Tech A. 2025 Jun;35(6):437-444. doi: 10.1089/lap.2024.0277. Epub 2025 Apr 26.
This study assesses the reliability of artificial intelligence (AI) large language models (LLMs) in identifying relevant literature comparing inguinal hernia repair techniques. We used LLM chatbots (Bing Chat AI, ChatGPT versions 3.5 and 4.0, and Gemini) to find comparative studies and randomized controlled trials on inguinal hernia repair techniques. The results were then compared with existing systematic reviews (SRs) and meta-analyses and checked for the authenticity of listed articles. LLMs screened 22 studies from 2006 to 2023 across eight journals, while the SRs encompassed a total of 42 studies. Through thorough external validation, 63.6% of the studies (14 out of 22), including 10 identified through Chat GPT 4.0 and 6 via Bing AI (with an overlap of 2 studies between them), were confirmed to be authentic. Conversely, 36.3% (8 out of 22) were revealed as fabrications by Google Gemini (Bard), with two (25.0%) of these fabrications mistakenly linked to valid DOIs. Four (25.6%) of the 14 real studies were acknowledged in the SRs, which represents 18.1% of all LLM-generated studies. LLMs missed a total of 38 (90.5%) of the studies included in the previous SRs, while 10 real studies were found by the LLMs but were not included in the previous SRs. Between those 10 studies, 6 were reviews, and 1 was published after the SRs, leaving a total of three comparative studies missed by the reviews. This study reveals the mixed reliability of AI language models in scientific searches. Emphasizing a cautious application of AI in academia and the importance of continuous evaluation of AI tools in scientific investigations.
本研究评估了人工智能(AI)大语言模型(LLMs)在识别比较腹股沟疝修补技术相关文献方面的可靠性。我们使用大语言模型聊天机器人(必应聊天AI、ChatGPT 3.5和4.0版本以及Gemini)来查找关于腹股沟疝修补技术的比较研究和随机对照试验。然后将结果与现有的系统评价(SRs)和荟萃分析进行比较,并检查所列文章的真实性。大语言模型在8种期刊中筛选出了2006年至2023年的22项研究,而系统评价总共涵盖了42项研究。通过全面的外部验证,63.6%的研究(22项中的14项)被确认为真实的,其中包括通过Chat GPT 4.0识别出的10项和通过必应AI识别出的6项(两者之间有2项重叠)。相反,谷歌Gemini(巴德)揭示36.3%(22项中的8项)是伪造的,其中两项(25.0%)伪造的研究错误地链接到了有效的数字对象标识符(DOIs)。14项真实研究中有4项(25.6%)在系统评价中被提及,这占所有大语言模型生成研究的18.1%。大语言模型总共遗漏了先前系统评价中包含的38项研究(90.5%),而大语言模型发现了10项真实研究但未被纳入先前的系统评价。在这10项研究中,6项是综述,1项是在系统评价之后发表的,因此系统评价总共遗漏了3项比较研究。本研究揭示了人工智能语言模型在科学搜索中的可靠性参差不齐。强调在学术界谨慎应用人工智能以及在科学研究中持续评估人工智能工具的重要性。