Suppr超能文献

不同空间、时间和语法尺度下的语言统计

Language Statistics at Different Spatial, Temporal, and Grammatical Scales.

作者信息

Sánchez-Puig Fernanda, Lozano-Aranda Rogelio, Pérez-Méndez Dante, Colman Ewan, Morales-Guzmán Alfredo J, Rivera Torres Pedro Juan, Pineda Carlos, Gershenson Carlos

机构信息

Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico.

Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico.

出版信息

Entropy (Basel). 2024 Aug 29;26(9):734. doi: 10.3390/e26090734.

Abstract

In recent decades, the field of statistical linguistics has made significant strides, which have been fueled by the availability of data. Leveraging Twitter data, this paper explores the English and Spanish languages, investigating their rank diversity across different scales: temporal intervals (ranging from 3 to 96 h), spatial radii (spanning 3 km to over 3000 km), and grammatical word ngrams (ranging from 1-grams to 5-grams). The analysis focuses on word ngrams, examining a time period of 1 year (2014) and eight different countries. Our findings highlight the relevance of all three scales with the most substantial changes observed at the grammatical level. Specifically, at the monogram level, rank diversity curves exhibit remarkable similarity across languages, countries, and temporal or spatial scales. However, as the grammatical scale expands, variations in rank diversity become more pronounced and influenced by temporal, spatial, linguistic, and national factors. Additionally, we investigate the statistical characteristics of Twitter-specific tokens, including emojis, hashtags, and user mentions, revealing a sigmoid pattern in their rank diversity function. These insights contribute to quantifying universal language statistics while also identifying potential sources of variation.

摘要

近几十年来,统计语言学领域取得了重大进展,数据的可得性推动了这些进展。本文利用推特数据,对英语和西班牙语进行了探索,研究了它们在不同尺度上的排名多样性:时间间隔(从3小时到96小时)、空间半径(从3公里到3000多公里)以及语法词元组(从单字组到五字组)。分析聚焦于词元组,考察了一年(2014年)的时间段以及八个不同国家。我们的研究结果凸显了所有这三个尺度的相关性,其中在语法层面观察到的变化最为显著。具体而言,在单字组层面,排名多样性曲线在不同语言、国家以及时间或空间尺度上呈现出显著的相似性。然而,随着语法尺度的扩大,排名多样性的变化变得更加明显,并受到时间、空间、语言和国家因素的影响。此外,我们研究了推特特定标记(包括表情符号、主题标签和用户提及)的统计特征,揭示了它们排名多样性函数中的S形模式。这些见解有助于量化通用语言统计数据,同时也识别出潜在的变化来源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/544c/11431497/a917b95b8a2d/entropy-26-00734-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验