Zhang Kaixiong, Zhang Yongbing, Yu Zhengtao, Huang Yuxin, Tan Kaiwen
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China.
Math Biosci Eng. 2024 Jan;21(1):1125-1143. doi: 10.3934/mbe.2024047. Epub 2022 Dec 25.
Cross-lingual summarization (CLS) is the task of condensing lengthy source language text into a concise summary in a target language. This presents a dual challenge, demanding both cross-language semantic understanding (i.e., semantic alignment) and effective information compression capabilities. Traditionally, researchers have tackled these challenges using two types of methods: pipeline methods (e.g., translate-then-summarize) and end-to-end methods. The former is intuitive but prone to error propagation, particularly for low-resource languages. The later has shown an impressive performance, due to multilingual pre-trained models (mPTMs). However, mPTMs (e.g., mBART) are primarily trained on resource-rich languages, thereby limiting their semantic alignment capabilities for low-resource languages. To address these issues, this paper integrates the intuitiveness of pipeline methods and the effectiveness of mPTMs, and then proposes a two-stage fine-tuning method for low-resource cross-lingual summarization (TFLCLS). In the first stage, by recognizing the deficiency in the semantic alignment for low-resource languages in mPTMs, a semantic alignment fine-tuning method is employed to enhance the mPTMs' understanding of such languages. In the second stage, while considering that mPTMs are not originally tailored for information compression and CLS demands the model to simultaneously align and compress, an adaptive joint fine-tuning method is introduced. This method further enhances the semantic alignment and information compression abilities of mPTMs that were trained in the first stage. To evaluate the performance of TFLCLS, a low-resource CLS dataset, named Vi2ZhLow, is constructed from scratch; moreover, two additional low-resource CLS datasets, En2ZhLow and Zh2EnLow, are synthesized from widely used large-scale CLS datasets. Experimental results show that TFCLS outperforms state-of-the-art methods by 18.88%, 12.71% and 16.91% in ROUGE-2 on the three datasets, respectively, even when limited with only 5,000 training samples.
跨语言摘要(CLS)是将冗长的源语言文本浓缩为目标语言的简洁摘要的任务。这带来了双重挑战,既需要跨语言语义理解(即语义对齐),又需要有效的信息压缩能力。传统上,研究人员使用两种类型的方法来应对这些挑战:管道式方法(例如,先翻译后摘要)和端到端方法。前者直观,但容易出现错误传播,特别是对于低资源语言。后者由于多语言预训练模型(mPTM)而表现出令人印象深刻的性能。然而,mPTM(例如,mBART)主要在资源丰富的语言上进行训练,从而限制了它们对低资源语言的语义对齐能力。为了解决这些问题,本文整合了管道式方法的直观性和mPTM的有效性,然后提出了一种用于低资源跨语言摘要的两阶段微调方法(TFLCLS)。在第一阶段,通过认识到mPTM中低资源语言语义对齐的不足,采用语义对齐微调方法来增强mPTM对这些语言的理解。在第二阶段,考虑到mPTM最初并非为信息压缩而设计,而CLS要求模型同时进行对齐和压缩,引入了一种自适应联合微调方法。该方法进一步增强了在第一阶段训练的mPTM的语义对齐和信息压缩能力。为了评估TFLCLS的性能,从零开始构建了一个低资源CLS数据集,名为Vi2ZhLow;此外,还从广泛使用的大规模CLS数据集中合成了另外两个低资源CLS数据集,En2ZhLow和Zh2EnLow。实验结果表明,即使仅使用5000个训练样本,TFLCLS在三个数据集上的ROUGE-2指标上分别比现有方法高出18.88%、12.71%和16.91%。