Dipartimento di Fisica e Astronomia "Galileo Galilei", Istituto Nazionale di Fisica Nucleare, Università degli Studi di Padova, Padova, Italy.
Dipartimento di Matematica "Tullio Levi-Civita", Università degli Studi di Padova, Padova, Italy.
PLoS One. 2021 Jul 1;16(7):e0253461. doi: 10.1371/journal.pone.0253461. eCollection 2021.
Big data require new techniques to handle the information they come with. Here we consider four datasets (email communication, Twitter posts, Wikipedia articles and Gutenberg books) and propose a novel statistical framework to predict global statistics from random samples. More precisely, we infer the number of senders, hashtags and words of the whole dataset and how their abundances (i.e. the popularity of a hashtag) change through scales from a small sample of sent emails per sender, posts per hashtag and word occurrences. Our approach is grounded on statistical ecology as we map inference of human activities into the unseen species problem in biodiversity. Our findings may have applications to resource management in emails, collective attention monitoring in Twitter and language learning process in word databases.
大数据需要新的技术来处理其所带来的信息。在这里,我们考虑了四个数据集(电子邮件通信、Twitter 帖子、维基百科文章和古腾堡书籍),并提出了一个新颖的统计框架,以便从随机样本中预测全局统计数据。更准确地说,我们从每个发件人发送的少量电子邮件、每个标签的帖子和单词出现次数中,推断出整个数据集的发件人数量、标签和单词数量,以及它们的丰度(即标签的流行度)如何随尺度变化。我们的方法基于统计生态学,因为我们将人类活动的推断映射到生物多样性中看不见的物种问题中。我们的发现可能适用于电子邮件中的资源管理、Twitter 中的集体注意力监测以及单词数据库中的语言学习过程。