Suppr超能文献

一个用于评估大语言模型的天文问答数据集。

An astronomical question answering dataset for evaluating large language models.

作者信息

Li Jie, Zhao Fuyong, Chen Panfeng, Xie Jiafu, Zhang Xiangrui, Li Hui, Chen Mei, Wang Yanhao, Zhu Ming

机构信息

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.

School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China.

出版信息

Sci Data. 2025 Mar 18;12(1):447. doi: 10.1038/s41597-025-04613-9.

Abstract

Large language models (LLMs) have recently demonstrated exceptional capabilities across a variety of linguistic tasks including question answering (QA). However, it remains challenging to assess their performance in astronomical QA due to the lack of comprehensive benchmark datasets. To bridge this gap, we construct Astro-QA, the first benchmark dataset specifically for QA in astronomy. The dataset contains a collection of 3,082 questions of six types in both English and Chinese, along with standard (reference) answers and related material. These questions encompass several core branches of astronomy, including astrophysics, astrometry, celestial mechanics, history of astronomy, and astronomical techniques and methods. Furthermore, we propose a new measure called DGscore that integrates different measures for objective and subjective questions and incorporates a weighting scheme based on type- and question-specific difficulty coefficients to accurately assess the QA performance of each LLM. We validate the Astro-QA dataset through extensive experimentation with 27 open-source and commercial LLMs. The results show that it can serve as a reliable benchmark dataset to evaluate the capacity of LLM in terms of instruction following, knowledge reasoning, and natural language generation in the astronomical domain, which can calibrate current progress and facilitate future research of astronomical LLMs.

摘要

大语言模型(LLMs)最近在包括问答(QA)在内的各种语言任务中展现出了卓越的能力。然而,由于缺乏全面的基准数据集,评估它们在天文问答方面的性能仍然具有挑战性。为了弥补这一差距,我们构建了Astro-QA,这是首个专门用于天文问答的基准数据集。该数据集包含3082个六种类型的英文和中文问题,以及标准答案和相关材料。这些问题涵盖了天文学的几个核心分支,包括天体物理学、天体测量学、天体力学、天文学史以及天文技术和方法。此外,我们提出了一种名为DGscore的新度量方法,它整合了针对客观题和主观题的不同度量,并纳入了基于类型和问题特定难度系数的加权方案,以准确评估每个大语言模型的问答性能。我们通过对27个开源和商业大语言模型进行广泛实验来验证Astro-QA数据集。结果表明,它可以作为一个可靠的基准数据集,用于评估大语言模型在天文领域的遵循指令、知识推理和自然语言生成能力,这可以校准当前的进展并促进未来天文大语言模型的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbbc/11920588/dfa226ff6689/41597_2025_4613_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验