• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

描述谷歌图书语料库:社会文化与语言演变推断的严格限制

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

作者信息

Pechenick Eitan Adam, Danforth Christopher M, Dodds Peter Sheridan

机构信息

Department of Mathematics and Statistics, University of Vermont, Burlington, Vermont, United States of America; Center for Complex Systems, University of Vermont, Burlington, Vermont, United States of America; Computational Story Lab, University of Vermont, Burlington, Vermont, United States of America; Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America.

出版信息

PLoS One. 2015 Oct 7;10(10):e0137041. doi: 10.1371/journal.pone.0137041. eCollection 2015.

DOI:10.1371/journal.pone.0137041
PMID:26445406
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4596490/
Abstract

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900 s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

摘要

人们很容易将谷歌图书数据集中的词频趋势视为各种词汇和短语“真正”流行程度的指标。这样做能让我们就特定主题(如时间或性别)的文化认知演变得出在数量上颇具说服力的结论。然而,谷歌图书语料库存在一些局限性,这使其成为文化流行程度的一个模糊表象。一个主要问题是,该语料库实际上是一个图书馆,每本书只包含一本。因此,一位多产的作者就能显著地将新短语插入谷歌图书词汇表中,无论这位作者是否广为人知。明白了这一点后,谷歌图书语料库仍是一个重要的数据集,应被视为更像词汇表而非文本。在此,我们表明,科学文本的纳入产生了一个明显的问题特征,在整个20世纪,科学文本在语料库中所占比例越来越大。结果是学术文章中常见但在一般情况下较少出现的短语大量增加,比如以引用形式提及时间的表述。我们运用信息论方法,通过考察和比较1800 - 2000年期间几十年间英语数据集的差异测度来突出这些动态变化。我们发现,只有语料库第二版中的英语小说数据集受专业文本的影响不大。总体而言,我们的研究结果对从谷歌图书语料库得出的绝大多数现有论断提出了质疑,并指出在使用这些数据集就文化和语言演变得出广泛结论之前,有必要全面描述该语料库的动态变化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/501d7b8024c3/pone.0137041.g016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/05328d60d487/pone.0137041.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/e8b9844f764b/pone.0137041.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/5e75cbeb9822/pone.0137041.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/fdd681aa14b7/pone.0137041.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/258829b0d819/pone.0137041.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/3fbcb61211a6/pone.0137041.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/2a5b6f85824d/pone.0137041.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/e61e8680aabd/pone.0137041.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/cc39c6a8cf26/pone.0137041.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/f35ddc42065d/pone.0137041.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/3a4fb86625b7/pone.0137041.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/6bc3221c32bd/pone.0137041.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/6eeee2af4853/pone.0137041.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/5fc65e0dbeb5/pone.0137041.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/114443453753/pone.0137041.g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/501d7b8024c3/pone.0137041.g016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/05328d60d487/pone.0137041.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/e8b9844f764b/pone.0137041.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/5e75cbeb9822/pone.0137041.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/fdd681aa14b7/pone.0137041.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/258829b0d819/pone.0137041.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/3fbcb61211a6/pone.0137041.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/2a5b6f85824d/pone.0137041.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/e61e8680aabd/pone.0137041.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/cc39c6a8cf26/pone.0137041.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/f35ddc42065d/pone.0137041.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/3a4fb86625b7/pone.0137041.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/6bc3221c32bd/pone.0137041.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/6eeee2af4853/pone.0137041.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/5fc65e0dbeb5/pone.0137041.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/114443453753/pone.0137041.g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/4596490/501d7b8024c3/pone.0137041.g016.jpg

相似文献

1
Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.描述谷歌图书语料库:社会文化与语言演变推断的严格限制
PLoS One. 2015 Oct 7;10(10):e0137041. doi: 10.1371/journal.pone.0137041. eCollection 2015.
2
Increases in individualistic words and phrases in American books, 1960-2008.1960-2008 年美国书籍中个人主义词汇和短语的使用频次增加。
PLoS One. 2012;7(7):e40181. doi: 10.1371/journal.pone.0040181. Epub 2012 Jul 10.
3
Quantitative analysis of culture using millions of digitized books.利用数百万本数字化书籍进行文化的定量分析。
Science. 2011 Jan 14;331(6014):176-82. doi: 10.1126/science.1199644. Epub 2010 Dec 16.
4
Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources.俄语中基于知识的事件抽取:基于语料库的语言资源
Comput Intell Neurosci. 2016;2016:4183760. doi: 10.1155/2016/4183760. Epub 2016 Jan 5.
5
Economic performance and public concerns about social class in twentieth-century books.20世纪书籍中的经济表现与公众对社会阶层的关注。
Soc Sci Res. 2016 Sep;59:37-51. doi: 10.1016/j.ssresearch.2016.04.007. Epub 2016 Apr 7.
6
Assessing the usefulness of google books' word frequencies for psycholinguistic research on word processing.评估谷歌图书的词频在词汇加工心理语言学研究中的有用性。
Front Psychol. 2011 Mar 2;2:27. doi: 10.3389/fpsyg.2011.00027. eCollection 2011.
7
The rise and fall of rationality in language.语言合理性的兴衰。
Proc Natl Acad Sci U S A. 2021 Dec 21;118(51). doi: 10.1073/pnas.2107848118.
8
Not all cultural values are created equal: Cultural change in China reexamined through Google books.并非所有文化价值观都是平等的:通过谷歌图书重新审视中国的文化变迁
Int J Psychol. 2019 Feb;54(1):144-154. doi: 10.1002/ijop.12436. Epub 2017 Jun 20.
9
Easy-to-read texts for students with intellectual disability: linguistic factors affecting comprehension.智障学生易懂的文本:影响理解的语言因素。
J Appl Res Intellect Disabil. 2014 May;27(3):212-25. doi: 10.1111/jar.12065. Epub 2013 Jul 1.
10
Pandemics, epidemics, viruses, plagues, and disease: Comparative frequency analysis of a cultural pathology reflected in science fiction magazines from 1926 to 2015.大流行、流行病、病毒、瘟疫与疾病:对1926年至2015年科幻杂志中所反映的一种文化病理学的比较频率分析
Soc Sci Humanit Open. 2020;2(1):100048. doi: 10.1016/j.ssaho.2020.100048. Epub 2020 Sep 9.

引用本文的文献

1
Towards an estimate of the impact of censorship on biomedical literature.对审查制度对生物医学文献影响的评估
J Am Med Inform Assoc. 2025 Jul 1;32(7):1199-1205. doi: 10.1093/jamia/ocaf089.
2
Lexical innovations are rarely passed on during one's lifetime: Epidemiological perspectives on estimating the basic reproductive ratio of words.词汇创新在一个人的一生中很少会传承下去:关于估计词汇基本繁殖率的流行病学观点。
PLoS One. 2024 Dec 5;19(12):e0312336. doi: 10.1371/journal.pone.0312336. eCollection 2024.
3
The impact of terrorist attacks on cultural values as expressed in books.

本文引用的文献

1
Books average previous decade of economic misery.书籍反映了过去十年的经济困境。
PLoS One. 2014 Jan 8;9(1):e83147. doi: 10.1371/journal.pone.0083147. eCollection 2014.
2
The changing psychology of culture from 1800 through 2000.从 1800 年到 2000 年,文化的不断变化的心理。
Psychol Sci. 2013 Sep;24(9):1722-31. doi: 10.1177/0956797613479387. Epub 2013 Aug 7.
3
Languages cool as they expand: allometric scaling and the decreasing need for new words.语言随着扩展而变得更加酷:异速生长和对新词的需求减少。
恐怖袭击对书籍所表达的文化价值观的影响。
PLoS One. 2024 Nov 22;19(11):e0311095. doi: 10.1371/journal.pone.0311095. eCollection 2024.
4
The rising entropy of English in the attention economy.注意力经济中英语熵值的上升。
Commun Psychol. 2024 Aug 1;2(1):70. doi: 10.1038/s44271-024-00117-1.
5
Mechanisms upholding the persistence of stigma across 100 years of historical text.在长达100年的历史文本中维持污名化现象持续存在的机制。
Sci Rep. 2024 May 14;14(1):11069. doi: 10.1038/s41598-024-61044-z.
6
Anomalous diffusion analysis of semantic evolution in major Indo-European languages.主要印欧语系语义演变的异常扩散分析。
PLoS One. 2024 Mar 26;19(3):e0298650. doi: 10.1371/journal.pone.0298650. eCollection 2024.
7
How Male and Female Literary Authors Write About Affect Across Cultures and Over Historical Periods.不同文化和历史时期的男性与女性文学作家如何书写情感。
Affect Sci. 2023 Sep 5;4(4):770-780. doi: 10.1007/s42761-023-00219-9. eCollection 2023 Dec.
8
How cognitive selection affects language change.认知选择如何影响语言变化。
Proc Natl Acad Sci U S A. 2024 Jan 2;121(1):e2220898120. doi: 10.1073/pnas.2220898120. Epub 2023 Dec 27.
9
Benford's Law applies to word frequency rank in English, German, French, Spanish, and Italian.本福德定律适用于英语、德语、法语、西班牙语和意大利语中的单词频率排名。
PLoS One. 2023 Sep 14;18(9):e0291337. doi: 10.1371/journal.pone.0291337. eCollection 2023.
10
Expansion and evolution of the R programming language.R编程语言的扩展与演进。
R Soc Open Sci. 2023 Apr 12;10(4):221550. doi: 10.1098/rsos.221550. eCollection 2023 Apr.
Sci Rep. 2012;2:943. doi: 10.1038/srep00943. Epub 2012 Dec 10.
4
Increases in individualistic words and phrases in American books, 1960-2008.1960-2008 年美国书籍中个人主义词汇和短语的使用频次增加。
PLoS One. 2012;7(7):e40181. doi: 10.1371/journal.pone.0040181. Epub 2012 Jul 10.
5
Statistical laws governing fluctuations in word use from word birth to word death.从词语诞生到词语消亡过程中,词语使用波动的统计规律。
Sci Rep. 2012;2:313. doi: 10.1038/srep00313. Epub 2012 Mar 15.
6
Temporal patterns of happiness and information in a global social network: hedonometrics and Twitter.全球社交网络中的快乐和信息的时间模式:快乐计量学和 Twitter。
PLoS One. 2011;6(12):e26752. doi: 10.1371/journal.pone.0026752. Epub 2011 Dec 7.
7
Quantitative analysis of culture using millions of digitized books.利用数百万本数字化书籍进行文化的定量分析。
Science. 2011 Jan 14;331(6014):176-82. doi: 10.1126/science.1199644. Epub 2010 Dec 16.
8
Experimental study of inequality and unpredictability in an artificial cultural market.人工文化市场中不平等与不可预测性的实验研究
Science. 2006 Feb 10;311(5762):854-6. doi: 10.1126/science.1121066.