• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过广义熵研究词汇动态与语言变化:样本量问题。

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

作者信息

Koplenig Alexander, Wolfer Sascha, Müller-Spitzer Carolin

机构信息

Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany.

出版信息

Entropy (Basel). 2019 May 3;21(5):464. doi: 10.3390/e21050464.

DOI:10.3390/e21050464
PMID:33267178
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7514953/
Abstract

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf's law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.

摘要

最近的研究表明,α阶广义熵为量化符号序列的相似性提供了新的重要机遇,其中α是一个自由参数。改变这个参数可以在相应词频谱的特定尺度上放大不同文本之间的差异。对于自然语言统计特性的分析而言,这一点尤为有趣,因为文本数据具有齐普夫定律的特征,即出现频率很高的词类很少(例如,表示语法关系的功能词),而出现频率很低的词类很多(例如,承载句子大部分意义的实词)。在此,通过分析德国周刊(该周刊包含1947年至2017年间发表的约36.5万篇文章和2.37亿个单词)的词汇动态,对这种方法进行了系统的实证研究。我们表明,与定量语言学中的大多数其他度量类似,基于广义熵的相似性度量在很大程度上依赖于样本大小(即文本长度)。我们认为,这使得量化词汇动态和语言变化变得困难,并表明标准的抽样方法无法解决这个问题。我们讨论了这些结果对语言统计分析的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/60add4039b11/entropy-21-00464-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/79153d132220/entropy-21-00464-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/30833b5f96b5/entropy-21-00464-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/bfc114250d8b/entropy-21-00464-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/dd5d5a8b6ca6/entropy-21-00464-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/26ec802d2ce6/entropy-21-00464-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/60add4039b11/entropy-21-00464-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/79153d132220/entropy-21-00464-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/30833b5f96b5/entropy-21-00464-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/bfc114250d8b/entropy-21-00464-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/dd5d5a8b6ca6/entropy-21-00464-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/26ec802d2ce6/entropy-21-00464-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b6f/7514953/60add4039b11/entropy-21-00464-g006.jpg

相似文献

1
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.通过广义熵研究词汇动态与语言变化:样本量问题。
Entropy (Basel). 2019 May 3;21(5):464. doi: 10.3390/e21050464.
2
Large-Scale Analysis of Zipf's Law in English Texts.英文文本中齐普夫定律的大规模分析。
PLoS One. 2016 Jan 22;11(1):e0147073. doi: 10.1371/journal.pone.0147073. eCollection 2016.
3
Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts.词频的齐普夫定律:长文本中的词形与词元
PLoS One. 2015 Jul 9;10(7):e0129031. doi: 10.1371/journal.pone.0129031. eCollection 2015.
4
Random texts do not exhibit the real Zipf's law-like rank distribution.随机文本并不表现出真正的齐普夫定律式的等级分布。
PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.
5
Stochastic Time-Series Analyses Highlight the Day-To-Day Dynamics of Lexical Frequencies.随机时间序列分析突出了词汇频率的日常动态变化。
Cogn Sci. 2022 Dec;46(12):e13215. doi: 10.1111/cogs.13215.
6
Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation.通过句子形成中的样本空间坍缩理解齐普夫词频定律。
J R Soc Interface. 2015 Jul 6;12(108):20150330. doi: 10.1098/rsif.2015.0330.
7
Zipf's word frequency law in natural language: a critical review and future directions.自然语言中的齐普夫词频定律:批判性综述与未来方向
Psychon Bull Rev. 2014 Oct;21(5):1112-30. doi: 10.3758/s13423-014-0585-6.
8
From Boltzmann to Zipf through Shannon and Jaynes.从玻尔兹曼经香农和杰恩斯到齐普夫。
Entropy (Basel). 2020 Feb 5;22(2):179. doi: 10.3390/e22020179.
9
Zipf's law holds for phrases, not words.齐普夫定律适用于短语,而非单词。
Sci Rep. 2015 Aug 11;5:12209. doi: 10.1038/srep12209.
10
Deviation of Zipf's and Heaps' Laws in human languages with limited dictionary sizes.有限词汇量下人类语言中齐夫定律和赫普定律的偏离。
Sci Rep. 2013;3:1082. doi: 10.1038/srep01082. Epub 2013 Jan 30.

引用本文的文献

1
A large quantitative analysis of written language challenges the idea that all languages are equally complex.一项针对书面语言的大规模定量分析对所有语言都同样复杂这一观点提出了挑战。
Sci Rep. 2023 Sep 16;13(1):15351. doi: 10.1038/s41598-023-42327-3.
2
Information Theory and Language.信息论与语言
Entropy (Basel). 2020 Apr 11;22(4):435. doi: 10.3390/e22040435.

本文引用的文献

1
What is information?†.什么是信息?†
Philos Trans A Math Phys Eng Sci. 2016 Mar 13;374(2063). doi: 10.1098/rsta.2015.0230.
2
Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.描述谷歌图书语料库:社会文化与语言演变推断的严格限制
PLoS One. 2015 Oct 7;10(10):e0137041. doi: 10.1371/journal.pone.0137041. eCollection 2015.
3
Universals versus historical contingencies in lexical evolution.词汇演变中的普遍性与历史偶然性
J R Soc Interface. 2014 Dec 6;11(101):20140841. doi: 10.1098/rsif.2014.0841.
4
The civilizing process in London's Old Bailey.伦敦老贝利的文明进程。
Proc Natl Acad Sci U S A. 2014 Jul 1;111(26):9419-24. doi: 10.1073/pnas.1405984111. Epub 2014 Jun 16.
5
Quantitative patterns of stylistic influence in the evolution of literature.文学演变中的风格影响的定量模式。
Proc Natl Acad Sci U S A. 2012 May 15;109(20):7682-6. doi: 10.1073/pnas.1115407109. Epub 2012 Apr 30.
6
Quantitative analysis of culture using millions of digitized books.利用数百万本数字化书籍进行文化的定量分析。
Science. 2011 Jan 14;331(6014):176-82. doi: 10.1126/science.1199644. Epub 2010 Dec 16.