用于识别文本中情感表达的机器学习：提高现有方法的准确性

Machine Learning for Identifying Emotional Expression in Text: Improving the Accuracy of Established Methods.

作者信息

Bantum Erin O, Elhadad Noémie, Owen Jason E, Zhang Shaodian, Golant Mitch, Buzaglo Joanne, Stephen Joanne, Giese-Davis Janine

机构信息

University of Hawaii Cancer Center; Cancer Prevention & Control Program.

Columbia University; Biomedical Informatics.

出版信息

J Technol Behav Sci. 2017 Mar;2(1):21-27. doi: 10.1007/s41347-017-0015-5. Epub 2017 Apr 4.

DOI:10.1007/s41347-017-0015-5

PMID:32885036

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7467127/

Abstract

Expression of emotion has been linked to numerous critical and beneficial aspects of human functioning. Accurately capturing emotional expression in text grows in relevance as people continue to spend more time in an online environment. The Linguistic Inquiry and Word Count (LIWC) is a commonly used program for the identification of many constructs, including emotional expression. In an earlier study (Bantum & Owen, 2009) LIWC was demonstrated to have good sensitivity yet poor positive predictive value. The goal of the current study was to create an automated machine learning technique to mimic manual coding. The sample included online support groups, cancer discussion boards, and transcripts from an expressive writing study, which resulted in 39,367 sentence-level coding decisions. In examining the entire sample the machine learning approach outperformed LIWC, in all categories outside of Sensitivity for negative emotion (LIWC Sensitivity = .85; Machine Learning Sensitivity = .41), although LIWC does not take into consideration prosocial emotion, such as affection, interest, and validation. LIWC performed significantly better than the machine learning approach when removing the prosocial emotions ( <.0001). The sample over-represented examples of emotion that fit into the overarching category of positive emotion. Remaining work is needed to create more effective machine learning features for codes that are thought to be important emotionally but were not well represented in the sample (e.g., frustration, contempt, and belligerence), and Machine Learning could be a fruitful method for continued exploration.

摘要

情绪表达与人类机能的诸多关键且有益的方面相关联。随着人们在网络环境中花费的时间不断增加，准确捕捉文本中的情绪表达变得愈发重要。语言查询与字数统计程序（LIWC）是一种常用的用于识别包括情绪表达在内的多种结构的程序。在一项早期研究（班图姆和欧文，2009年）中，LIWC被证明具有良好的敏感性，但阳性预测值较低。本研究的目标是创建一种自动化机器学习技术来模拟人工编码。样本包括在线支持小组、癌症讨论板以及一项表达性写作研究的文字记录，这导致了39367个句子层面的编码决策。在检查整个样本时，机器学习方法在除负面情绪敏感性之外的所有类别中均优于LIWC（LIWC敏感性 = 0.85；机器学习敏感性 = 0.41），不过LIWC并未考虑诸如喜爱、兴趣和认可等亲社会情绪。在去除亲社会情绪后，LIWC的表现显著优于机器学习方法（p <.0001）。该样本中符合积极情绪总体类别的情绪示例占比过高。对于那些在情感上被认为重要但在样本中未得到充分体现的编码（例如，沮丧、轻蔑和挑衅），仍需开展进一步工作以创建更有效的机器学习特征，并且机器学习可能是持续探索的一种富有成效的方法。