使用自然语言处理和机器学习来取代人工内容编码员。

Using natural language processing and machine learning to replace human content coders.

作者信息

Wang Yilei, Tian Jingyuan, Yazar Yagizhan, Ones Deniz S, Landers Richard N

机构信息

Department of Psychology, University of Minnesota at Twin Cities.

出版信息

Psychol Methods. 2024 Dec;29(6):1148-1163. doi: 10.1037/met0000518. Epub 2022 Aug 25.

DOI:10.1037/met0000518

PMID:36006759

Abstract

Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. (PsycInfo Database Record (c) 2024 APA, all rights reserved).

摘要

内容分析是心理学研究中量化和理解定性数据的一种常见且灵活的技术。然而，内容分析的实际实施极其耗费人力，并且容易出现人工编码错误。应用自然语言处理（NLP）技术有助于解决这些局限性。我们向心理学研究人员解释并说明这些技术。为此，我们首先展示一项探索对人类内容代码进行心理测量学上有意义预测的创建过程的研究。利用现有的人类内容代码数据库，我们构建了一个NLP算法，以通常可接受的标准有效地预测这些代码。然后，我们进行了蒙特卡罗模拟，以模拟四个数据集特征（即样本大小、未标记案例比例、分类基础比率和人工编码可靠性）如何影响内容分类性能。模拟结果表明，样本大小和未标记比例对模型分类性能的影响往往呈曲线关系。此外，基础比率和人工编码可靠性对分类性能有很强的影响。最后，利用这些结果，我们就实现内容代码有效预测所需的数据集特征向心理学家提供实用建议，以指导研究人员在内容分析研究中使用NLP模型来取代人工编码员。（PsycInfo数据库记录（c）2024美国心理学会，保留所有权利）