一种用于低资源跨语言摘要的两阶段微调方法。

A two-stage fine-tuning method for low-resource cross-lingual summarization.

作者信息

Zhang Kaixiong, Zhang Yongbing, Yu Zhengtao, Huang Yuxin, Tan Kaiwen

机构信息

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.

Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China.

出版信息

Math Biosci Eng. 2024 Jan;21(1):1125-1143. doi: 10.3934/mbe.2024047. Epub 2022 Dec 25.

DOI:10.3934/mbe.2024047

PMID:38303457

Abstract

Cross-lingual summarization (CLS) is the task of condensing lengthy source language text into a concise summary in a target language. This presents a dual challenge, demanding both cross-language semantic understanding (i.e., semantic alignment) and effective information compression capabilities. Traditionally, researchers have tackled these challenges using two types of methods: pipeline methods (e.g., translate-then-summarize) and end-to-end methods. The former is intuitive but prone to error propagation, particularly for low-resource languages. The later has shown an impressive performance, due to multilingual pre-trained models (mPTMs). However, mPTMs (e.g., mBART) are primarily trained on resource-rich languages, thereby limiting their semantic alignment capabilities for low-resource languages. To address these issues, this paper integrates the intuitiveness of pipeline methods and the effectiveness of mPTMs, and then proposes a two-stage fine-tuning method for low-resource cross-lingual summarization (TFLCLS). In the first stage, by recognizing the deficiency in the semantic alignment for low-resource languages in mPTMs, a semantic alignment fine-tuning method is employed to enhance the mPTMs' understanding of such languages. In the second stage, while considering that mPTMs are not originally tailored for information compression and CLS demands the model to simultaneously align and compress, an adaptive joint fine-tuning method is introduced. This method further enhances the semantic alignment and information compression abilities of mPTMs that were trained in the first stage. To evaluate the performance of TFLCLS, a low-resource CLS dataset, named Vi2ZhLow, is constructed from scratch; moreover, two additional low-resource CLS datasets, En2ZhLow and Zh2EnLow, are synthesized from widely used large-scale CLS datasets. Experimental results show that TFCLS outperforms state-of-the-art methods by 18.88%, 12.71% and 16.91% in ROUGE-2 on the three datasets, respectively, even when limited with only 5,000 training samples.

摘要

跨语言摘要（CLS）是将冗长的源语言文本浓缩为目标语言的简洁摘要的任务。这带来了双重挑战，既需要跨语言语义理解（即语义对齐），又需要有效的信息压缩能力。传统上，研究人员使用两种类型的方法来应对这些挑战：管道式方法（例如，先翻译后摘要）和端到端方法。前者直观，但容易出现错误传播，特别是对于低资源语言。后者由于多语言预训练模型（mPTM）而表现出令人印象深刻的性能。然而，mPTM（例如，mBART）主要在资源丰富的语言上进行训练，从而限制了它们对低资源语言的语义对齐能力。为了解决这些问题，本文整合了管道式方法的直观性和mPTM的有效性，然后提出了一种用于低资源跨语言摘要的两阶段微调方法（TFLCLS）。在第一阶段，通过认识到mPTM中低资源语言语义对齐的不足，采用语义对齐微调方法来增强mPTM对这些语言的理解。在第二阶段，考虑到mPTM最初并非为信息压缩而设计，而CLS要求模型同时进行对齐和压缩，引入了一种自适应联合微调方法。该方法进一步增强了在第一阶段训练的mPTM的语义对齐和信息压缩能力。为了评估TFLCLS的性能，从零开始构建了一个低资源CLS数据集，名为Vi2ZhLow；此外，还从广泛使用的大规模CLS数据集中合成了另外两个低资源CLS数据集，En2ZhLow和Zh2EnLow。实验结果表明，即使仅使用5000个训练样本，TFLCLS在三个数据集上的ROUGE-2指标上分别比现有方法高出18.88%、12.71%和16.91%。

相似文献

A two-stage fine-tuning method for low-resource cross-lingual summarization.一种用于低资源跨语言摘要的两阶段微调方法。

Math Biosci Eng. 2024 Jan;21(1):1125-1143. doi: 10.3934/mbe.2024047. Epub 2022 Dec 25.

Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning.多层次多语言语义对齐的零镜头跨语言迁移学习。

Neural Netw. 2024 May;173:106217. doi: 10.1016/j.neunet.2024.106217. Epub 2024 Feb 27.

Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10697-10714. doi: 10.1109/TPAMI.2024.3447778. Epub 2024 Nov 6.

Dataset construction method of cross-lingual summarization based on filtering and text augmentation.基于过滤和文本增强的跨语言摘要数据集构建方法

PeerJ Comput Sci. 2023 Mar 28;9:e1299. doi: 10.7717/peerj-cs.1299. eCollection 2023.

On cross-lingual retrieval with multilingual text encoders.关于使用多语言文本编码器进行跨语言检索。

Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.

Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.临床文本的神经机器翻译：对多语言预训练语言模型和迁移学习的实证研究。

Front Digit Health. 2024 Feb 26;6:1211564. doi: 10.3389/fdgth.2024.1211564. eCollection 2024.

Sequence-to-sequence pretraining for a less-resourced Slovenian language.针对资源较少的斯洛文尼亚语的序列到序列预训练。

Front Artif Intell. 2023 Mar 28;6:932519. doi: 10.3389/frai.2023.932519. eCollection 2023.

Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval.用于跨语言文档检索的深度多标签多语言文档学习

Entropy (Basel). 2022 Jul 7;24(7):943. doi: 10.3390/e24070943.

Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。

MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.

TeaBERT: An Efficient Knowledge Infused Cross-Lingual Language Model for Mapping Chinese Medical Entities to the Unified Medical Language System.茶伯特：一种高效的知识注入跨语言语言模型，用于将中文医学实体映射到统一医学语言系统。

IEEE J Biomed Health Inform. 2023 Dec;27(12):6029-6038. doi: 10.1109/JBHI.2023.3315143. Epub 2023 Dec 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于低资源跨语言摘要的两阶段微调方法。

A two-stage fine-tuning method for low-resource cross-lingual summarization.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献