Suppr超能文献

通过广义熵研究词汇动态与语言变化:样本量问题。

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

作者信息

Koplenig Alexander, Wolfer Sascha, Müller-Spitzer Carolin

机构信息

Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany.

出版信息

Entropy (Basel). 2019 May 3;21(5):464. doi: 10.3390/e21050464.

Abstract

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf's law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.

摘要

最近的研究表明,α阶广义熵为量化符号序列的相似性提供了新的重要机遇,其中α是一个自由参数。改变这个参数可以在相应词频谱的特定尺度上放大不同文本之间的差异。对于自然语言统计特性的分析而言,这一点尤为有趣,因为文本数据具有齐普夫定律的特征,即出现频率很高的词类很少(例如,表示语法关系的功能词),而出现频率很低的词类很多(例如,承载句子大部分意义的实词)。在此,通过分析德国周刊(该周刊包含1947年至2017年间发表的约36.5万篇文章和2.37亿个单词)的词汇动态,对这种方法进行了系统的实证研究。我们表明,与定量语言学中的大多数其他度量类似,基于广义熵的相似性度量在很大程度上依赖于样本大小(即文本长度)。我们认为,这使得量化词汇动态和语言变化变得困难,并表明标准的抽样方法无法解决这个问题。我们讨论了这些结果对语言统计分析的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/79153d132220/entropy-21-00464-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验