Chacoma A, Zanette D H
Instituto de Física Enrique Gaviola, Consejo Nacional de Investigaciones Científicas y Técnicas and Universidad Nacional de Córdoba, Ciudad Universitaria, 5000 Córdoba, Pcia. de Córdoba, Argentina.
Centro Atómico Bariloche and Instituto Balseiro, Comisión Nacional de Energía Atómica and Universidad Nacional de Cuyo, Consejo Nacional de Investigaciones Científicas y Técnicas, Av. Bustillo 9500, 8400 San Carlos de Bariloche, Pcia. de Río Negro, Argentina.
R Soc Open Sci. 2020 Mar 18;7(3):200008. doi: 10.1098/rsos.200008. eCollection 2020 Mar.
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, , and ), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with and being respectively more and less retarded than the mean trend, and following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
我们研究了由六位作家创作的75部英文文学作品语料库中词汇量与文本长度之间的关系,区分了三个语法类别(或“标签”,即 、 和 )的贡献,并分析了每个标签的新单词在每篇单独文本中的逐步出现情况。我们发现,正如希普斯定律所规定的那样,词汇量和文本长度遵循明确的幂律关系。同时,每篇文本中新单词的出现并不遵循幂律,总体上可以通过文本随机洗牌的平均值很好地描述。然而,与该平均值的偏差在统计上是显著的,并且在整个语料库中呈现出系统的趋势。具体而言,我们发现每篇文本中新单词的出现相对于随机洗牌的平均值主要是延迟的。此外,不同的标签对这种趋势有系统地做出不同的贡献, 比平均趋势分别更延迟和更不延迟,而 则遵循总体平均值。这些统计系统性可能表明在希普斯定律的不同变体中存在语言相关信息,这一特征仍需要广泛评估。