• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.APPLS:评估用于平实语言摘要的评估指标
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
MeaningBERT: assessing meaning preservation between sentences.MeaningBERT:评估句子间的语义保留情况。
Front Artif Intell. 2023 Sep 22;6:1223924. doi: 10.3389/frai.2023.1223924. eCollection 2023.
4
A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.长格式医院病程总结忠实度指标的元评估
Proc Mach Learn Res. 2023 Aug;219:2-30.
5
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations.医学多文档摘要的自动化指标与人类评估结果不一致。
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:9871-9889. doi: 10.18653/v1/2023.acl-long.549.
6
Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries.评估大语言模型聊天机器人生成通俗易懂摘要的能力。
Cureus. 2025 Mar 21;17(3):e80976. doi: 10.7759/cureus.80976. eCollection 2025 Mar.
7
What Author Instructions Do Health Journals Provide for Writing Plain Language Summaries? A Scoping Review.健康期刊对撰写通俗易懂的摘要提供了哪些作者指南?一项范围综述。
Patient. 2023 Jan;16(1):31-42. doi: 10.1007/s40271-022-00606-7. Epub 2022 Oct 27.
8
Quantifying the informativeness for biomedical literature summarization: An itemset mining method.量化生物医学文献摘要的信息量:一种基于项集挖掘的方法。
Comput Methods Programs Biomed. 2017 Jul;146:77-89. doi: 10.1016/j.cmpb.2017.05.011. Epub 2017 May 27.
9
Ascle-A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study.Ascle-A 是一个用于医疗文本生成的 Python 自然语言处理工具包:开发和评估研究。
J Med Internet Res. 2024 Oct 3;26:e60601. doi: 10.2196/60601.
10
Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses.用于医学诊断自然语言生成的人类评估框架的开发及其与自动化指标的相关性
medRxiv. 2024 Apr 9:2024.03.20.24304620. doi: 10.1101/2024.03.20.24304620.

引用本文的文献

1
Explainable AI for Clinical Outcome Prediction: A Survey of Clinician Perceptions and Preferences.用于临床结果预测的可解释人工智能:临床医生认知与偏好调查
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:215-224. eCollection 2025.
2
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark.重新思考科学摘要评估:基于方面感知基准建立可解释的指标
ArXiv. 2024 Feb 22:arXiv:2402.14359v1.

本文引用的文献

1
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations.医学多文档摘要的自动化指标与人类评估结果不一致。
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:9871-9889. doi: 10.18653/v1/2023.acl-long.549.
2
Retrieval augmentation of large language models for lay language generation.大语言模型的检索增强用于生成通俗语言。
J Biomed Inform. 2024 Jan;149:104580. doi: 10.1016/j.jbi.2023.104580. Epub 2023 Dec 30.
3
A dataset for plain language adaptation of biomedical abstracts.生物医学文摘的自然语言适应数据集。
Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.
4
A survey of automated methods for biomedical text simplification.生物医学文本简化的自动化方法调查。
J Am Med Inform Assoc. 2022 Oct 7;29(11):1976-1988. doi: 10.1093/jamia/ocac149.
5
Plain language summaries: A systematic review of theory, guidelines and empirical research.简明语言摘要:理论、指南和实证研究的系统综述。
PLoS One. 2022 Jun 6;17(6):e0268789. doi: 10.1371/journal.pone.0268789. eCollection 2022.
6
Paragraph-level Simplification of Medical Texts.医学文本的段落级简化
Proc Conf. 2021 Jun;2021:4972-4984. doi: 10.18653/v1/2021.naacl-main.395.
7
Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.生成(真实的?)RCT 叙述性摘要:神经多文档摘要实验。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:605-614. eCollection 2021.
8
Next-generation metrics for monitoring genetic erosion within populations of conservation concern.用于监测受保护种群内遗传侵蚀的下一代指标。
Evol Appl. 2017 Nov 22;11(7):1066-1083. doi: 10.1111/eva.12564. eCollection 2018 Aug.
9
The Role of Surface, Semantic and Grammatical Features on Simplification of Spanish Medical Texts: A User Study.表面、语义和语法特征对西班牙语医学文本简化的作用:一项用户研究。
AMIA Annu Symp Proc. 2018 Apr 16;2017:1322-1331. eCollection 2017.
10
Measuring Text Difficulty Using Parse-Tree Frequency.利用句法树频率测量文本难度
J Assoc Inf Sci Technol. 2017 Sep;68(9):2088-2100. doi: 10.1002/asi.23855. Epub 2017 Jun 20.

APPLS:评估用于平实语言摘要的评估指标

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.

作者信息

Guo Yue, August Tal, Leroy Gondy, Cohen Trevor, Wang Lucy Lu

机构信息

University of Illinois Urbana-Champaign.

University of Arizona.

出版信息

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.

DOI:10.18653/v1/2024.emnlp-main.519
PMID:40144005
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11938995/
Abstract

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.

摘要

虽然自然语言摘要(PLS)模型已经有了显著发展,但评估仍然是一项挑战。PLS缺乏专门的评估指标,并且由于所涉及的独特转换(例如,添加背景解释、去除行话),文本生成评估指标的适用性尚不清楚。为了解决这些问题,我们的研究引入了一个细粒度的元评估测试平台APPLS,旨在评估PLS的指标。我们从先前的工作中确定了四个PLS标准——信息性、简化、连贯性和忠实性——并定义了一组与这些标准相对应的扰动,敏感指标应该能够检测到这些扰动。我们将这些扰动应用于两个PLS数据集的文本,以创建我们的测试平台。使用APPLS,我们评估了14个指标的性能,包括自动评分、词汇特征和基于大语言模型提示的评估。我们的分析表明,虽然一些当前指标对特定标准表现出敏感性,但没有一种方法能同时捕捉到所有四个标准。因此,我们建议使用一套自动指标来全面衡量PLS的质量。这项工作为PLS贡献了首个元评估测试平台,并对现有指标进行了全面评估。