Torre Iván G, Luque Bartolo, Lacasa Lucas, Kello Christopher T, Hernández-Fernández Antoni
Departamento de Matemática Aplicada, ETSIAE, Universidad Politécnica de Madrid, Plaza Cardenal Cisneros, 28040 Madrid, Spain.
Cognitive and Information Sciences, University of California Merced, 5200 North Lake Road Merced, 95343 CA, USA.
R Soc Open Sci. 2019 Aug 21;6(8):191023. doi: 10.1098/rsos.191023. eCollection 2019 Aug.
Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols. In this work, we examine whether linguistic laws hold with respect to the physical manifestations of linguistic units in spoken English. The data we analyse come from a phonetically transcribed database of acoustic recordings of spontaneous speech known as the Buckeye Speech corpus. First, we verify with unprecedented accuracy that acoustically transcribed durations of linguistic units at several scales comply with a lognormal distribution, and we quantitatively justify this 'lognormality law' using a stochastic generative model. Second, we explore the four classical linguistic laws (Zipf's Law, Herdan's Law, Brevity Law and Menzerath-Altmann's Law (MAL)) in oral communication, both in physical units and in symbolic units measured in the speech transcriptions, and find that the validity of these laws is typically stronger when using physical units than in their symbolic counterpart. Additional results include (i) coining a Herdan's Law in physical units, (ii) a precise mathematical formulation of Brevity Law, which we show to be connected to optimal compression principles in information theory and allows to formulate and validate yet another law which we call the size-rank law or (iii) a mathematical derivation of MAL which also highlights an additional regime where the law is inverted. Altogether, these results support the hypothesis that statistical laws in language have a physical origin.
语言单位的物理表现包括由于语音产生因素导致的变异性来源,根据定义,这些因素被排除在语言符号计数之外。在这项工作中,我们研究语言规律对于英语口语中语言单位的物理表现是否成立。我们分析的数据来自一个名为“七叶树语音语料库”的自发语音声学记录的语音转录数据库。首先,我们以前所未有的准确性验证了几个尺度上语言单位的声学转录时长符合对数正态分布,并使用随机生成模型对这一“对数正态性定律”进行了定量论证。其次,我们在口语交流中探索了四条经典语言规律(齐普夫定律、赫尔丹定律、简洁定律和门泽拉斯 - 阿尔特曼定律(MAL)),无论是在物理单位还是在语音转录中测量的符号单位中,并发现使用物理单位时这些定律的有效性通常比其符号对应物更强。其他结果包括:(i)在物理单位中创造了赫尔丹定律;(ii)简洁定律的精确数学公式,我们证明它与信息论中的最优压缩原理相关,并允许制定和验证另一条我们称为大小 - 秩定律的定律;或者(iii)MAL的数学推导,这也突出了该定律反转的另一种情况。总之,这些结果支持了语言中的统计规律具有物理起源这一假设。