Zheng Jianyu, Liu Ying
Department of Chinese Language and Literature, Tsinghua University, Beijing, China.
PeerJ Comput Sci. 2022 Mar 15;8:e899. doi: 10.7717/peerj-cs.899. eCollection 2022.
Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time-"language identity". We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information.
预训练多语言模型已广泛应用于跨语言信息处理任务。现有工作侧重于提高预训练多语言模型的迁移性能,但忽略了模型在编码时保留的语言属性——“语言身份”。我们研究了当前最先进的预训练多语言模型(mBERT、XLM、XLM-R)通过语言类型学保留语言身份的能力。我们从语言、类型学特征和内部隐藏层方面探索了模型差异和变化。我们发现,整个模型及其每个隐藏层在保留语言身份方面的能力顺序为:mBERT > XLM-R > XLM。此外,所有这三个模型都能很好地捕捉形态、词汇、词序和句法特征,但在名词和动词特征方面表现不佳。最后,我们的结果表明,XLM-R和XLM的能力在各层之间保持稳定,但mBERT的能力波动严重。我们的研究结果总结了每个预训练多语言模型及其隐藏层存储语言身份和类型学特征的能力。它为后续处理跨语言信息的研究人员提供了见解。