Suppr超能文献

APPLS:评估用于平实语言摘要的评估指标

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.

作者信息

Guo Yue, August Tal, Leroy Gondy, Cohen Trevor, Wang Lucy Lu

机构信息

University of Illinois Urbana-Champaign.

University of Arizona.

出版信息

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.

Abstract

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.

摘要

虽然自然语言摘要(PLS)模型已经有了显著发展,但评估仍然是一项挑战。PLS缺乏专门的评估指标,并且由于所涉及的独特转换(例如,添加背景解释、去除行话),文本生成评估指标的适用性尚不清楚。为了解决这些问题,我们的研究引入了一个细粒度的元评估测试平台APPLS,旨在评估PLS的指标。我们从先前的工作中确定了四个PLS标准——信息性、简化、连贯性和忠实性——并定义了一组与这些标准相对应的扰动,敏感指标应该能够检测到这些扰动。我们将这些扰动应用于两个PLS数据集的文本,以创建我们的测试平台。使用APPLS,我们评估了14个指标的性能,包括自动评分、词汇特征和基于大语言模型提示的评估。我们的分析表明,虽然一些当前指标对特定标准表现出敏感性,但没有一种方法能同时捕捉到所有四个标准。因此,我们建议使用一套自动指标来全面衡量PLS的质量。这项工作为PLS贡献了首个元评估测试平台,并对现有指标进行了全面评估。

相似文献

1
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.APPLS:评估用于平实语言摘要的评估指标
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.
3
MeaningBERT: assessing meaning preservation between sentences.MeaningBERT:评估句子间的语义保留情况。
Front Artif Intell. 2023 Sep 22;6:1223924. doi: 10.3389/frai.2023.1223924. eCollection 2023.

本文引用的文献

4
A survey of automated methods for biomedical text simplification.生物医学文本简化的自动化方法调查。
J Am Med Inform Assoc. 2022 Oct 7;29(11):1976-1988. doi: 10.1093/jamia/ocac149.
6
Paragraph-level Simplification of Medical Texts.医学文本的段落级简化
Proc Conf. 2021 Jun;2021:4972-4984. doi: 10.18653/v1/2021.naacl-main.395.
10
Measuring Text Difficulty Using Parse-Tree Frequency.利用句法树频率测量文本难度
J Assoc Inf Sci Technol. 2017 Sep;68(9):2088-2100. doi: 10.1002/asi.23855. Epub 2017 Jun 20.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验