• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

重新思考科学摘要评估:基于方面感知基准建立可解释的指标

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark.

作者信息

Chen Xiuying, Wang Tairan, Zhu Qingqing, Guo Taicheng, Gao Shen, Lu Zhiyong, Gao Xin, Zhang Xiangliang

机构信息

King Abdullah University of Science & Technology.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

出版信息

ArXiv. 2024 Feb 22:arXiv:2402.14359v1.

PMID:39371090
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11451640/
Abstract

The summarization capabilities of pretrained and large language models (LLMs) have been widely validated in general areas, but their use in scientific corpus, which involves complex sentences and specialized knowledge, has been less assessed. This paper presents conceptual and experimental analyses of scientific summarization, highlighting the inadequacies of traditional evaluation methods, such as -gram, embedding comparison, and QA, particularly in providing explanations, grasping scientific concepts, or identifying key content. Subsequently, we introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries based on different aspects. This facet-aware approach offers a thorough evaluation of abstracts by decomposing the evaluation task into simpler subtasks. Recognizing the absence of an evaluation benchmark in this domain, we curate a Facet-based scientific summarization Dataset (FD) with facet-level annotations. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries. In addition, fine-tuned smaller models can compete with LLMs in scientific contexts, while LLMs have limitations in learning from in-context information in scientific domains. This suggests an area for future enhancement of LLMs.

摘要

预训练和大语言模型(LLMs)的总结能力在一般领域已得到广泛验证,但其在涉及复杂句子和专业知识的科学语料库中的应用评估较少。本文对科学总结进行了概念和实验分析,强调了传统评估方法(如 -gram、嵌入比较和问答)的不足之处,特别是在提供解释、理解科学概念或识别关键内容方面。随后,我们引入了方面感知度量(FM),利用大语言模型进行高级语义匹配,以基于不同方面评估总结。这种方面感知方法通过将评估任务分解为更简单的子任务,对摘要进行全面评估。鉴于该领域缺乏评估基准,我们精心策划了一个带有方面级注释的基于方面的科学总结数据集(FD)。我们的研究结果证实,FM为评估科学总结提供了一种更合理的方法。此外,经过微调的较小模型在科学背景下可以与大语言模型竞争,而大语言模型在从科学领域的上下文信息中学习方面存在局限性。这为大语言模型未来的改进指明了一个方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/77244daf70ed/nihpp-2402.14359v2-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/799a5f65176e/nihpp-2402.14359v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/ba2eacd02086/nihpp-2402.14359v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/510ec75bef3c/nihpp-2402.14359v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/1279829ce4e9/nihpp-2402.14359v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b11fa4f33d31/nihpp-2402.14359v2-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/9c541ff5daea/nihpp-2402.14359v2-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/c489f18cbcad/nihpp-2402.14359v2-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/3df1e33191dd/nihpp-2402.14359v2-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b4150420ba4e/nihpp-2402.14359v2-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b42b479b59c4/nihpp-2402.14359v2-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/77244daf70ed/nihpp-2402.14359v2-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/799a5f65176e/nihpp-2402.14359v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/ba2eacd02086/nihpp-2402.14359v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/510ec75bef3c/nihpp-2402.14359v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/1279829ce4e9/nihpp-2402.14359v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b11fa4f33d31/nihpp-2402.14359v2-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/9c541ff5daea/nihpp-2402.14359v2-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/c489f18cbcad/nihpp-2402.14359v2-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/3df1e33191dd/nihpp-2402.14359v2-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b4150420ba4e/nihpp-2402.14359v2-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/b42b479b59c4/nihpp-2402.14359v2-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecf1/12281930/77244daf70ed/nihpp-2402.14359v2-f0011.jpg

相似文献

1
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark.重新思考科学摘要评估:基于方面感知基准建立可解释的指标
ArXiv. 2024 Feb 22:arXiv:2402.14359v1.
2
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
3
Short-Term Memory Impairment短期记忆障碍
4
Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.通过在出院小结中添加重点内容提高大语言模型的总结准确性:比较评估
JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.
5
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
6
Autistic Students' Experiences of Employment and Employability Support while Studying at a UK University.自闭症学生在英国大学学习期间的就业经历及就业支持情况
Autism Adulthood. 2025 Apr 3;7(2):212-222. doi: 10.1089/aut.2024.0112. eCollection 2025 Apr.
7
Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study.通过症状描述和总结调整大型语言模型以增强精神病学访谈:初步研究。
JMIR Form Res. 2024 Oct 24;8:e58418. doi: 10.2196/58418.
8
Developing healthcare language model embedding spaces.开发医疗保健语言模型嵌入空间。
Artif Intell Med. 2024 Dec;158:103009. doi: 10.1016/j.artmed.2024.103009. Epub 2024 Oct 31.
9
Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.在德国,开源大语言模型可用于肿瘤记录吗?——对泌尿科医生笔记的评估
BioData Min. 2025 Jul 24;18(1):48. doi: 10.1186/s13040-025-00463-8.
10
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

本文引用的文献

1
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.APPLS:评估用于平实语言摘要的评估指标
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.
2
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations.医学多文档摘要的自动化指标与人类评估结果不一致。
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:9871-9889. doi: 10.18653/v1/2023.acl-long.549.
3
Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.
生成(真实的?)RCT 叙述性摘要:神经多文档摘要实验。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:605-614. eCollection 2021.
4
Evaluation of PICO as a knowledge representation for clinical questions.评估PICO作为临床问题的知识表示形式。
AMIA Annu Symp Proc. 2006;2006:359-63.
5
The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey.引言、方法、结果与讨论(IMRAD)结构:一项为期五十年的调查
J Med Libr Assoc. 2004 Jul;92(3):364-7.