Leibniz Institute for the German Language (IDS), Mannheim, Germany.
Sci Rep. 2023 Oct 28;13(1):18521. doi: 10.1038/s41598-023-45373-z.
Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs-ranging from very simple n-gram models to state-of-the-art deep neural networks-on written cross-linguistic corpus data covering 1293 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.
计算语言模型(LMs),特别是 OpenAI 的广受欢迎的 ChatGPT 聊天机器人的成功范例,在广泛的语言任务上表现出令人印象深刻的性能,从而为认知科学和语言学提供了一个计算工作模型,以经验性地研究人类语言的不同方面。在这里,我们使用 LMs 来检验这样一个假设,即使用者较多的语言往往更容易学习。在两个实验中,我们使用涵盖 1293 种不同语言的书面跨语言语料库数据来训练几个 LMs,从非常简单的 n-gram 模型到最先进的深度神经网络,并对学习难度进行统计估计。我们使用各种定量方法和机器学习技术来解释语言的系统发育亲缘关系和地理位置的接近程度,结果表明,学习难度与说话人数量之间存在着稳健的关系。然而,与先前研究得出的预期相反,我们的结果表明,使用者较多的语言往往更难学习。