Nagaraj Akarsh, Kejriwal Mayank
University of Southern California, Information Sciences Institute, 4676 Admiralty Way, Suite 1001, Marina del Rey 90292 CA, United States.
Data Brief. 2022 Feb 2;41:107905. doi: 10.1016/j.dib.2022.107905. eCollection 2022 Apr.
Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research.
近期的论述突出了经济、社会和文化生活诸多方面存在的显著性别差异。随着人工智能(AI)和自然语言处理(NLP)先进工具的出现,有机会利用计算和数字工具来分析语料库,比如古登堡计划语料库中前现代时期(本文定义为大约在1800年至1950年间出版的书籍)版权已过期的文献。然而,使用此类工具存在挑战,尤其是要保持足够高的质量以探究有趣的假设。我们展示了一个数据集和相关材料,阐述了如何将NLP的现代方法应用于古登堡计划中3000多篇文学文本的原始文本,以(i)高质量地从文本中提取人物和代词,(ii)消除人物歧义,使其不被重复计算,(iii)检测每个人物的性别。此外,我们还通过人工标注来确定发表这些文本的作者的性别,并将这些标注作为数据集的一部分发布,以促进未来的数字人文研究。