Aguilera Aurora Cobo, Olmos Pablo M, Artés-Rodríguez Antonio, Pérez-Cruz Fernando
Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain.
Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland.
Neural Netw. 2023 Apr;161:565-574. doi: 10.1016/j.neunet.2023.01.032. Epub 2023 Feb 9.
Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.
在过去十年中,语言模型(LM)一直在不断发展,从序列到序列架构发展到基于注意力机制的Transformer。然而,在这些结构中,正则化并没有得到深入研究。在这项工作中,我们使用高斯混合变分自编码器(GMVAE)作为正则化层。我们研究了它在放置深度方面的优势,并证明了它在几种情况下的有效性。实验结果表明,在基于Transformer的架构(如BERT、RoBERTa或XLM-R)中纳入深度生成模型可以带来更通用的模型,能够更好地泛化,并在诸如SST-2和TREC等任务中获得更高的插补分数,甚至能够用更丰富的文本插补缺失/有噪声的单词。