Demir Gizem Boztaş, Süküt Yağızalp, Duran Gökhan Serhat, Topsakal Kübra Gülnur, Görgülü Serkan
Department of Orthodontics, Gulhane Faculty of Dentistry, University of Health Sciences, Ankara, Türkiye.
Eur J Orthod. 2024 Apr 1;46(2). doi: 10.1093/ejo/cjae011.
The rapid advancement of Large Language Models (LLMs) has prompted an exploration of their efficacy in generating PICO-based (Patient, Intervention, Comparison, Outcome) queries, especially in the field of orthodontics. This study aimed to assess the usability of Large Language Models (LLMs), in aiding systematic review processes, with a specific focus on comparing the performance of ChatGPT 3.5 and ChatGPT 4 using a specialized prompt tailored for orthodontics.
MATERIALS/METHODS: Five databases were perused to curate a sample of 77 systematic reviews and meta-analyses published between 2016 and 2021. Utilizing prompt engineering techniques, the LLMs were directed to formulate PICO questions, Boolean queries, and relevant keywords. The outputs were subsequently evaluated for accuracy and consistency by independent researchers using three-point and six-point Likert scales. Furthermore, the PICO records of 41 studies, which were compatible with the PROSPERO records, were compared with the responses provided by the models.
ChatGPT 3.5 and 4 showcased a consistent ability to craft PICO-based queries. Statistically significant differences in accuracy were observed in specific categories, with GPT-4 often outperforming GPT-3.5.
The study's test set might not encapsulate the full range of LLM application scenarios. Emphasis on specific question types may also not reflect the complete capabilities of the models.
CONCLUSIONS/IMPLICATIONS: Both ChatGPT 3.5 and 4 can be pivotal tools for generating PICO-driven queries in orthodontics when optimally configured. However, the precision required in medical research necessitates a judicious and critical evaluation of LLM-generated outputs, advocating for a circumspect integration into scientific investigations.
大语言模型(LLMs)的迅速发展促使人们探索其在生成基于PICO(患者、干预措施、对照、结局)的问题方面的功效,尤其是在正畸领域。本研究旨在评估大语言模型在辅助系统评价过程中的可用性,特别关注使用为正畸量身定制的特定提示来比较ChatGPT 3.5和ChatGPT 4的性能。
材料/方法:查阅了五个数据库,以整理出2016年至2021年间发表的77篇系统评价和荟萃分析的样本。利用提示工程技术,指导大语言模型制定PICO问题、布尔查询和相关关键词。随后,独立研究人员使用三点和六点李克特量表对输出结果的准确性和一致性进行评估。此外,将41项与PROSPERO记录兼容的研究的PICO记录与模型提供的回答进行了比较。
ChatGPT 3.5和4展示了一致的能力来构建基于PICO的查询。在特定类别中观察到准确性存在统计学上的显著差异,GPT-4通常优于GPT-3.5。
该研究的测试集可能没有涵盖大语言模型应用场景的全部范围。对特定问题类型的强调也可能无法反映模型的全部能力。
结论/启示:当进行最佳配置时,ChatGPT 3.5和4都可以成为正畸领域生成由PICO驱动的查询的关键工具。然而,医学研究所需的精确性要求对大语言模型生成的输出进行审慎和批判性评估,主张谨慎地将其整合到科学研究中。