Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 11, 28049 Madrid, Spain.
BMC Bioinformatics. 2013 Feb 27;14:71. doi: 10.1186/1471-2105-14-71.
The position of a sentence in a document has been traditionally considered an indicator of the relevance of the sentence, and therefore it is frequently used by automatic summarization systems as an attribute for sentence selection. Sentences close to the beginning of the document are supposed to deal with the main topic and thus are selected for the summary. This criterion has shown to be very effective when summarizing some types of documents, such as news items. However, this property is not likely to be found in other types of documents, such as scientific articles, where other positional criteria may be preferred. The purpose of the present work is to study the utility of different positional strategies for biomedical literature summarization.
We have evaluated three different positional strategies: (1) awarding the sentences at the beginning of the document, (2) preferring those at the beginning and end of the document, and (3) weighting the sentences according to the section in which they appear. To this end, we have implemented two summarizers, one based on semantic graphs and the other based on concept frequencies, and evaluated the summaries they produce when combined with each of the positional strategies above using ROUGE metrics. Our results indicate that it is possible to improve the quality of the summaries by weighting the sentences according to the section in which they appear (≈17% improvement in ROUGE-2 for the graph-based summarizer and ≈20% for the frequency-based summarizer), and that the sections containing the more salient information are the Methods and Material and the Discussion and Results ones.
It has been found that the use of traditional positional criteria that award sentences at the beginning and/or the end of the document are not helpful when summarizing scientific literature. In contrast, a more appropriate strategy is that which weights sentences according to the section in which they appear.
句子在文档中的位置一直被认为是句子相关性的一个指标,因此它经常被自动摘要系统用作句子选择的属性。靠近文档开头的句子被认为是处理主题的,因此被选入摘要。当对新闻等某些类型的文档进行总结时,这一标准被证明非常有效。然而,在其他类型的文档(如科学文章)中,可能不会发现这种属性,而可能会优先考虑其他位置标准。本研究的目的是研究不同位置策略在生物医学文献摘要中的效用。
我们评估了三种不同的位置策略:(1)给文档开头的句子打分;(2)优先选择文档开头和结尾的句子;(3)根据句子出现的章节给句子加权。为此,我们实现了两种基于语义图和基于概念频率的摘要器,并使用 ROUGE 指标评估了它们与上述三种位置策略相结合生成的摘要。我们的结果表明,根据句子出现的章节给句子加权(基于语义图的摘要器的 ROUGE-2 约提高 17%,基于概念频率的摘要器约提高 20%)可以提高摘要的质量,而包含更突出信息的章节是方法和材料以及讨论和结果。
我们发现,在总结科学文献时,使用传统的给文档开头和/或结尾的句子打分的位置标准并没有帮助。相比之下,一种更合适的策略是根据句子出现的章节给句子加权。