Stanford University, USA.
Soc Sci Res. 2022 Nov;108:102798. doi: 10.1016/j.ssresearch.2022.102798. Epub 2022 Oct 1.
Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three "families" of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.
自本世纪初以来,以机器可读格式生成的人类文本形式的数据越来越多地为社会科学家所获取,为了解社会生活提供了独特的窗口。然而,以系统的方式利用大量这种高度非结构化的数据带来了独特的分析和方法学挑战的组合。幸运的是,在同一时期,我们对如何克服这些挑战的理解也有了很大的发展。在本文中,我提出了一种新的社会科学家用于在大规模上分析文本数据以检验和发展社会理论的方法的分类法。我描述了三种"方法家族":(1)词频分析、(2)文档结构分析和(3)语义相似性分析。对于每种方法家族,我讨论了它们的逻辑和统计基础、分析的优缺点,以及突出的变体和应用。