Räsänen Okko, Kocharov Daniil
Signal Processing Research Centre, Tampere University, Tampere, Finland.
Behav Res Methods. 2025 Sep 4;57(10):275. doi: 10.3758/s13428-025-02772-6.
Computational models of early language development involve implementing theories of learning as functional learning algorithms, exposing these models to realistic language input, and comparing learning outcomes to those in infants. While recent research has made major strides in developing more powerful learning models and evaluation protocols grounded in infant data, models are still predominantly trained with non-naturalistic input data, such as crowd-sourced read speech or text transcripts. This is due to the lack of suitable child-directed speech (CDS) corpora in terms of scale and quality. In parallel, the question of how properties and individual variability in language input affect learning outcomes is an active area of empirical research, underlining the need for realistic yet controllable data for modeling such phenomena. This paper presents a solution to the training data problem through stochastic generation of naturalistic CDS data using statistical models, thereby enabling controlled computational simulations with naturalistic input. We provide a proof-of-concept demonstration of the approach by showing how naturalistic CDS transcripts can be generated with a language model conditioned on recipient information (here, infant age), and how text-to-speech systems can be used to convert the transcripts to high-quality speech with a controllable speaking style. We also conduct modeling experiments with generated speech corpora by varying different aspects of the data, showing how this maps into different learning outcomes, thereby demonstrating the feasibility of the approach for controlled language learning simulations. Finally, we discuss the limitations of using synthetic data in general, and of the present proof-of-concept pipeline in particular.
早期语言发展的计算模型包括将学习理论作为功能性学习算法来实现,让这些模型接触现实的语言输入,并将学习结果与婴儿的学习结果进行比较。虽然最近的研究在开发基于婴儿数据的更强大的学习模型和评估协议方面取得了重大进展,但模型仍然主要使用非自然主义的输入数据进行训练,比如众包的朗读语音或文本转录本。这是因为在规模和质量方面缺乏合适的儿童导向型语言(CDS)语料库。与此同时,语言输入的属性和个体差异如何影响学习结果这一问题是实证研究的一个活跃领域,这突出表明需要用于对这类现象进行建模的现实但可控的数据。本文通过使用统计模型随机生成自然主义的CDS数据,提出了一种解决训练数据问题的方法,从而能够进行基于自然主义输入的可控计算模拟。我们通过展示如何使用基于接收者信息(这里是婴儿年龄)的语言模型生成自然主义的CDS转录本,以及如何使用文本转语音系统将转录本转换为具有可控说话风格的高质量语音,对该方法进行了概念验证演示。我们还通过改变数据的不同方面,对生成的语音语料库进行建模实验,展示这如何映射到不同的学习结果,从而证明该方法用于可控语言学习模拟的可行性。最后,我们讨论了一般使用合成数据的局限性,特别是当前概念验证流程的局限性。