Salkar Nikita, Trikalinos Thomas, Wallace Byron C, Nenkova Ani
Khoury College of Computer Sciences, Northeastern University, USA.
Health Services, Policy and Practice, Brown University, USA.
Proc Conf Assoc Comput Linguist Meet. 2022 Nov;2022:341-350.
We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of -grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.
我们对神经摘要生成器输出中的自我重复进行了定量和定性分析。我们将自我重复衡量为在同一系统的多个输出中出现的长度为四个或更长的 - 词元数量。我们分析了在五个数据集上进行微调的三种流行架构(BART、T5 和 Pegasus)的行为。在回归分析中,我们发现这三种架构在跨输入的输出摘要中重复内容的倾向不同,其中 BART 特别容易出现自我重复。在更抽象的数据以及具有公式化语言的数据上进行微调与更高的自我重复率相关。在定性分析中,我们发现系统会产生与被总结内容无关的人工制品,如广告和免责声明,以及微调领域中常见的公式化短语。我们对自我重复进行语料库级分析的方法可能有助于从业者清理摘要生成器的训练数据,并最终支持将自我重复量降至最低的方法。