Washington Peter, Kalantarian Haik, Kent Jack, Husic Arman, Kline Aaron, Leblanc Emilie, Hou Cathy, Mutlu Cezmi, Dunlap Kaitlyn, Penev Yordan, Stockham Nate, Chrisman Brianna, Paskov Kelley, Jung Jae-Yoon, Voss Catalin, Haber Nick, Wall Dennis P
Department of Bioengineering, Stanford University.
Department of Pediatrics (Systems Medicine), Stanford University.
Cognit Comput. 2021 Sep;13(5):1363-1373. doi: 10.1007/s12559-021-09936-4. Epub 2021 Sep 27.
BACKGROUND/INTRODUCTION: Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We hypothesize that training with labels that are representative of the diversity of human interpretation of an image will result in predictions that are similarly representative on a disjoint test set. We also hypothesize that crowdsourcing can generate distributions which mirror those generated in a lab setting.
We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a ResNet-152 neural network on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels.
While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for happy, neutral, sad and "fear + surprise", and 88.8% for "anger + disgust". While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014).
For many applications of affective computing, reporting an emotion probability distribution that accounts for the subjectivity of human interpretation can be more useful than an absolute label. Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels.
背景/引言:传统的情感检测分类器预测离散情感。然而,情感表达往往具有主观性,因此需要一种方法来处理复合和模糊的标签。我们探讨了使用众包来获取可靠的软目标标签的可行性,并评估了使用这些标签训练的情感检测分类器。我们假设,使用代表人类对图像解释多样性的标签进行训练,将在不相交的测试集上产生同样具有代表性的预测。我们还假设众包可以生成与实验室环境中生成的分布相似的分布。
我们的研究以儿童情感面部表情(CAFE)数据集为中心,这是一个描绘儿童面部表情的图像的黄金标准集合,每个图像还有100个人类标签。为了测试众包生成这些标签的可行性,我们使用Microworkers为207张CAFE图像获取标签。我们评估了未经过滤的工人以及通过简短的人群过滤过程选择的工人。然后,我们使用数据集中提供的原始100个注释,在软目标CAFE标签上训练了两个版本的ResNet-152神经网络:(1)一个使用传统的独热编码标签训练的分类器,(2)一个使用表示CAFE注释者响应分布的向量标签训练的分类器。我们使用分类器输出概率分布与人类标签分布之间的L1距离的双样本独立t检验,比较了两个分类器的softmax输出分布。
虽然未经过滤的人群工人与CAFE的一致性较弱,但经过过滤的人群在快乐、中性、悲伤和“恐惧+惊讶”方面100%与CAFE标签一致,在“愤怒+厌恶”方面88.8%一致。虽然相对于真实的CAFE标签,独热编码分类器的F1分数要高得多(94.33%对78.68%),但人群训练分类器的输出概率向量更接近人类标签的分布(t=3.2827,p=0.0014)。
对于情感计算的许多应用,报告一个考虑人类解释主观性的情感概率分布可能比绝对标签更有用。众包,包括一个用于选择可靠人群工人的充分过滤机制,是获取软目标标签的可行解决方案。