Gutierrez-Vasques Ximena, Mijangos Victor
Language and Space Lab, URPP Language and Space, University of Zurich, 8006 Zurich, Switzerland.
Institute of Philological Research, National Autonomous University of Mexico, 04510 Mexico City, Mexico.
Entropy (Basel). 2019 Dec 30;22(1):48. doi: 10.3390/e22010048.
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data.
我们提出了一种基于文本对语言形态复杂性进行量化的定量方法。几种基于语料库的方法专注于测量一种语言能够产生的不同词形。我们不仅考虑形态变化过程的生成能力,还考虑这些形态变化过程的可预测性。我们使用一种预测单词内子词序列概率的语言模型;我们计算该模型的熵率,并将其用作单词内部结构可预测性的度量。我们的结果表明,在测量形态复杂性时整合这两个维度很重要,因为语言在一种度量下可能很复杂,但在另一种度量下可能更简单。我们针对一组类型多样的语言,在两个不同的平行语料库中计算了复杂性度量。我们的方法是基于语料库的,并且不需要使用语言标注数据。