Turner Matthew D, Appaji Abhishek, Rakib Nibras Ar, Golnari Pedram, Rajasekar Arcot K, Rathnam K V Anitha, Sahoo Satya S, Wang Yue, Wang Lei, Turner Jessica A
Department of Psychiatry, The Ohio State University, Columbus, Ohio, USA.
Department of Medical Electronics Engineering, B.M.S. College of Engineering, Bengaluru, India.
bioRxiv. 2025 May 14:2025.05.13.653828. doi: 10.1101/2025.05.13.653828.
We show that recent (mid-to-late 2024) commercial large language models (LLMs) are capable of good quality metadata extraction and annotation with very little work on the part of investigators for several exemplar real-world annotation tasks in the neuroimaging literature. We investigated the GPT-4o LLM from OpenAI which performed comparably with several groups of specially trained and supervised human annotators. The LLM achieves similar performance to humans, between 0.91 and 0.97 on zero-shot prompts without feedback to the LLM. Reviewing the disagreements between LLM and gold standard human annotations we note that actual LLM errors are comparable to human errors in most cases, and in many cases these disagreements are not errors. Based on the specific types of annotations we tested, with exceptionally reviewed gold-standard correct values, the LLM performance is usable for metadata annotation at scale. We encourage other research groups to develop and make available more specialized "micro-benchmarks," like the ones we provide here, for testing both LLMs, and more complex agent systems annotation performance in real-world metadata annotation tasks.
我们表明,近期(2024年年中至年末)的商业大语言模型(LLM)能够高质量地提取和注释元数据,而研究人员只需付出极少的努力,就能完成神经影像学文献中几个典型的现实世界注释任务。我们研究了OpenAI的GPT-4o LLM,其表现与几组经过专门训练和监督的人类注释者相当。在无反馈的零样本提示下,该LLM的性能与人类相似,在0.91至0.97之间。通过审视LLM与黄金标准人类注释之间的差异,我们注意到在大多数情况下,LLM的实际错误与人类错误相当,而且在许多情况下,这些差异并非错误。基于我们测试的特定注释类型以及经过特别审核的黄金标准正确值,LLM的性能可用于大规模的元数据注释。我们鼓励其他研究团队开发并提供更多专门的“微基准测试”,就像我们在此提供的那样,用于测试LLM以及更复杂的智能体系统在现实世界元数据注释任务中的注释性能。