Gerlach Martin, Font-Clos Francesc
Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA.
Center for Complexity and Biosystems, Department of Physics, University of Milan, 20133 Milano, Italy.
Entropy (Basel). 2020 Jan 20;22(1):126. doi: 10.3390/e22010126.
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
在超过25年的时间里,将古登堡计划(PG)用作文本语料库在语言统计分析中一直极为流行。然而,与其他具有类似重要性的主要语言数据集不同,迄今为止尚无公认的PG完整版本。事实上,到目前为止,大多数PG研究要么只考虑少数几本手动挑选的书籍,这可能会导致有偏差的子集,要么采用差异极大的预处理策略(通常细节描述不足),这引发了对已发表结果可重复性的担忧。为了解决这些不足,我们在此展示标准化古登堡计划语料库(SPGC),这是一种开放科学方法,用于创建一个经过整理的完整PG数据版本,其中包含超过50,000本书籍和超过3×10⁹个词元。利用不同来源的带注释元数据,我们不仅对PG的内容进行了广泛描述,还展示了不同示例,突出了SPGC在研究跨时间、主题和作者的语言变异性方面的潜力。我们详细公布了我们的方法、下载和处理数据的代码,以及在三个不同粒度级别(原始文本、词元时间序列和词频计数)上获得的语料库本身。通过这种方式,我们提供了一个可重复、经过预处理的全尺寸古登堡计划版本,作为语料库语言学、自然语言处理和信息检索的新科学资源。