Miao Chen, Zhang Zhenghao, Chen Jiamin, Rebibo Daniel, Wu Haoran, Fung Sin-Hang, Cheng Alfred Sze-Lok, Tsui Stephen Kwok-Wing, Sinha Sanju, Cao Qin, Yip Kevin Y
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.
Comput Struct Biotechnol J. 2025 Jul 24;27:3299-3306. doi: 10.1016/j.csbj.2025.07.042. eCollection 2025.
While large language models (LLMs) have shown promising capabilities in biomedical applications, measuring their reliability in knowledge extraction remains a challenge. We developed a benchmark to compare LLMs in 11 literature knowledge extraction tasks that are foundational to automatic knowledgebase development, with or without task-specific examples supplied. We found large variation across the LLMs' performance, depending on the level of technical specialization, difficulty of tasks, scattering of original information, and format and terminology standardization requirements. We also found that asking the LLMs to provide the source text behind their answers is useful for overcoming some key challenges, but that specifying this requirement in the prompt is difficult.
虽然大语言模型(LLMs)在生物医学应用中已展现出颇具前景的能力,但衡量它们在知识提取方面的可靠性仍是一项挑战。我们开发了一个基准,用于比较大语言模型在11个文献知识提取任务中的表现,这些任务是自动知识库开发的基础,无论是否提供特定任务的示例。我们发现,大语言模型的性能存在很大差异,这取决于技术专业化程度、任务难度、原始信息的分散程度以及格式和术语标准化要求。我们还发现,要求大语言模型提供其答案背后的源文本有助于克服一些关键挑战,但在提示中明确这一要求却很困难。