Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Methods. 2022 Jul;203:160-166. doi: 10.1016/j.ymeth.2022.03.012. Epub 2022 Apr 2.
Abstractive summarization models can generate summary auto-regressively, but the quality is often impacted by the noise in the text. Learning cross-sentence relations is a crucial step in this task and the graph-based network is more effective to capture the sentence relationship. Moreover, knowledge is very important to distinguish the noise of the text in special domain. A novel model structure called UGDAS is proposed in this paper, which combines a sentence-level denoiser based on an unsupervised graph-network and an auto-regressive generator. It utilizes domain knowledge and sentence position information to denoise the original text and further improve the quality of generated summaries. We use the recently-introduced dataset CORD-19 (COVID-19 Open Research Dataset) on text summarization task, which contains large-scale data on coronaviruses. The experimental results show that our model achieves the SOTA (state-of-the-art) result on CORD-19 dataset and outperforms the related baseline models on the PubMed Abstract dataset.
摘要总结模型可以自动回归生成摘要,但文本中的噪声往往会影响质量。学习句子间关系是这项任务的关键步骤,基于图的网络在捕捉句子关系方面更为有效。此外,知识对于区分特定领域文本的噪声非常重要。本文提出了一种新的模型结构 UGDAS,它结合了基于无监督图网络的句子级去噪器和自动回归生成器。它利用领域知识和句子位置信息对原始文本进行去噪,并进一步提高生成摘要的质量。我们在文本摘要任务中使用了最近引入的 CORD-19(COVID-19 开放研究数据集)数据集,该数据集包含了大量关于冠状病毒的数据。实验结果表明,我们的模型在 CORD-19 数据集上达到了 SOTA(最先进)的结果,并且在 PubMed Abstract 数据集上优于相关基线模型。