Department of Psychology, University of Konstanz, Konstanz, Germany.
PLoS One. 2019 Mar 22;14(3):e0213554. doi: 10.1371/journal.pone.0213554. eCollection 2019.
The Google Books Ngram Viewer (Google Ngram) is a search engine that charts word frequencies from a large corpus of books and thereby allows for the examination of cultural change as it is reflected in books. While the tool's massive corpus of data (about 8 million books or 6% of all books ever published) has been used in various scientific studies, concerns about the accuracy of results have simultaneously emerged. This paper reviews the literature and serves as a guideline for improving Google Ngram studies by suggesting five methodological procedures suited to increase the reliability of results. In particular, we recommend the use of (I) different language corpora, (II) cross-checks on different corpora from the same language, (III) word inflections, (IV) synonyms, and (V) a standardization procedure that accounts for both the influx of data and unequal weights of word frequencies. Further, we outline how to combine these procedures and address the risk of potential biases arising from censorship and propaganda. As an example of the proposed procedures, we examine the cross-cultural expression of religion via religious terms for the years 1900 to 2000. Special emphasis is placed on the situation during World War II. In line with the strand of literature that emphasizes the decline of collectivistic values, our results suggest an overall decrease of religion's importance. However, religion re-gains importance during times of crisis such as World War II. By comparing the results obtained through the different methods, we illustrate that applying and particularly combining our suggested procedures increase the reliability of results and prevents authors from deriving wrong assumptions.
谷歌图书 N gram 查看器(Google Ngram)是一种搜索引擎,它可以从大量书籍语料库中绘制单词频率,从而可以检查书籍中反映的文化变化。虽然该工具庞大的语料库(约 800 万册书籍,占已出版书籍的 6%)已被用于各种科学研究,但同时也出现了对结果准确性的担忧。本文综述了文献,并提出了五种适合提高 Google Ngram 研究可靠性的方法程序,作为改进 Google Ngram 研究的指南。特别是,我们建议使用 (I) 不同的语言语料库、(II) 来自同一语言的不同语料库的交叉检查、(III) 词形变化、(IV) 同义词和 (V) 一种标准化程序,以考虑到数据的流入和单词频率的不等权重。此外,我们概述了如何结合这些程序并解决因审查和宣传而产生的潜在偏见的风险。作为所提议程序的一个示例,我们检查了 1900 年至 2000 年期间宗教术语的跨文化表达。特别强调了第二次世界大战期间的情况。与强调集体主义价值观下降的文献一致,我们的结果表明宗教的重要性总体上有所下降。然而,在像第二次世界大战这样的危机时期,宗教重新获得了重要性。通过比较通过不同方法获得的结果,我们说明了应用特别是结合我们建议的程序可以提高结果的可靠性,并防止作者得出错误的假设。