Törö Tuukka, Suni Antti, Šimko Juraj
Department of Digital Humanities, University of Helsinki, Helsinki, Finland.
PLoS One. 2025 Aug 25;20(8):e0330755. doi: 10.1371/journal.pone.0330755. eCollection 2025.
Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances. The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns. Challenges in visualizing relationships, particularly with hierarchical clustering and network-based methods, highlight the dynamic nature of language change. The findings show potential for scalable analyses of language variation based on speech embeddings, providing new perspectives on relationships among languages. By addressing methodological considerations such as corpus size and latent space dimensionality, this approach opens avenues for studying low-resource languages and bridging macro- and micro-level linguistic variation. Future work aims to extend these methods to underrepresented languages and integrate sociolinguistic variation for a more comprehensive understanding of linguistic diversity.
在全球范围内研究语言关系需要分析各种特征,如句法、音系和韵律,这些特征受内部多样性、语言接触和社会语言因素的影响,以不同的速度演变。机器学习(ML)的最新进展为传统的历史和类型学方法提供了补充选择。这些新方法不再依赖专家人力来分析特定的语言特征,而是通过直接从语音中提取的嵌入来探索语言变异,为大规模、数据驱动的分析开辟了新途径。本研究采用了经过微调的XLS-R自监督语言识别模型voxlingua107-xls-r-300m-wav2vec的嵌入,基于语音记录分析106种世界语言之间的关系。使用线性判别分析(LDA),对语言嵌入进行聚类,并与谱系、词汇和地理距离进行比较。结果表明,基于嵌入的距离与传统测量方法紧密对齐,有效地捕捉了全球和局部的类型学模式。在可视化关系方面的挑战,特别是使用层次聚类和基于网络的方法时,凸显了语言变化的动态性质。研究结果显示了基于语音嵌入对语言变异进行可扩展分析的潜力,为语言之间的关系提供了新视角。通过解决诸如语料库大小和潜在空间维度等方法学问题,这种方法为研究资源匮乏的语言以及弥合宏观和微观层面的语言变异开辟了道路。未来的工作旨在将这些方法扩展到代表性不足的语言,并整合社会语言变异,以更全面地理解语言多样性。