Suppr超能文献

大型语言模型可以提取元数据,用于注释人类神经影像学术出版物。

Large language models can extract metadata for annotation of human neuroimaging publications.

作者信息

Turner Matthew D, Appaji Abhishek, Ar Rakib Nibras, Golnari Pedram, Rajasekar Arcot K, K V Anitha Rathnam, Sahoo Satya S, Wang Yue, Wang Lei, Turner Jessica A

机构信息

Department of Psychiatry, The Ohio State University, Columbus, OH, United States.

Department of Medical Electronics Engineering, B.M.S. College of Engineering, Bengaluru, India.

出版信息

Front Neuroinform. 2025 Aug 20;19:1609077. doi: 10.3389/fninf.2025.1609077. eCollection 2025.

Abstract

We show that recent (mid-to-late 2024) commercial large language models (LLMs) are capable of good quality metadata extraction and annotation with very little work on the part of investigators for several exemplar real-world annotation tasks in the neuroimaging literature. We investigated the GPT-4o LLM from OpenAI which performed comparably with several groups of specially trained and supervised human annotators. The LLM achieves similar performance to humans, between 0.91 and 0.97 on zero-shot prompts without feedback to the LLM. Reviewing the disagreements between LLM and gold standard human annotations we note that actual LLM errors are comparable to human errors in most cases, and in many cases these disagreements are not errors. Based on the specific types of annotations we tested, with exceptionally reviewed gold-standard correct values, the LLM performance is usable for metadata annotation at scale. We encourage other research groups to develop and make available more specialized "micro-benchmarks," like the ones we provide here, for testing both LLMs, and more complex agent systems annotation performance in real-world metadata annotation tasks.

摘要

我们表明,近期(2024年年中至年末)的商业大语言模型(LLM)能够在神经影像文献中的几个示例性现实世界注释任务中,在研究人员几乎无需付出太多努力的情况下,实现高质量的元数据提取和注释。我们研究了OpenAI的GPT-4o LLM,其表现与几组经过专门训练和监督的人类注释者相当。在无反馈给LLM的零样本提示下,该LLM达到了与人类相似的性能,在0.91至0.97之间。回顾LLM与黄金标准人类注释之间的分歧,我们注意到在大多数情况下,LLM的实际错误与人类错误相当,而且在许多情况下,这些分歧并非错误。基于我们测试的特定注释类型以及经过特别审查的黄金标准正确值,LLM的性能可用于大规模的元数据注释。我们鼓励其他研究小组开发并提供更多专门的“微基准测试”,就像我们在此提供的那些,用于测试LLM以及更复杂的智能体系统在现实世界元数据注释任务中的注释性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验