Suppr超能文献

探究预训练多语言模型中编码的语言身份:类型学视角

Probing language identity encoded in pre-trained multilingual models: a typological view.

作者信息

Zheng Jianyu, Liu Ying

机构信息

Department of Chinese Language and Literature, Tsinghua University, Beijing, China.

出版信息

PeerJ Comput Sci. 2022 Mar 15;8:e899. doi: 10.7717/peerj-cs.899. eCollection 2022.

Abstract

Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time-"language identity". We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information.

摘要

预训练多语言模型已广泛应用于跨语言信息处理任务。现有工作侧重于提高预训练多语言模型的迁移性能,但忽略了模型在编码时保留的语言属性——“语言身份”。我们研究了当前最先进的预训练多语言模型(mBERT、XLM、XLM-R)通过语言类型学保留语言身份的能力。我们从语言、类型学特征和内部隐藏层方面探索了模型差异和变化。我们发现,整个模型及其每个隐藏层在保留语言身份方面的能力顺序为:mBERT > XLM-R > XLM。此外,所有这三个模型都能很好地捕捉形态、词汇、词序和句法特征,但在名词和动词特征方面表现不佳。最后,我们的结果表明,XLM-R和XLM的能力在各层之间保持稳定,但mBERT的能力波动严重。我们的研究结果总结了每个预训练多语言模型及其隐藏层存储语言身份和类型学特征的能力。它为后续处理跨语言信息的研究人员提供了见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d1a8/9044357/c58fb21119c1/peerj-cs-08-899-g003.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验