Frey William R, Patton Desmond U, Gaskell Michael B, McGregor Kyle A
Columbia University, New York City, NY, USA.
New York University Langone Health, New York City, NY, USA.
Soc Sci Comput Rev. 2020 Feb;38(1):42-56. doi: 10.1177/0894439318788314. Epub 2018 Jul 18.
Mining social media data for studying the human condition has created new and unique challenges. When analyzing social media data from marginalized communities, algorithms lack the ability to accurately interpret off-line context, which may lead to dangerous assumptions about and implications for marginalized communities. To combat this challenge, we hired formerly gang-involved young people as domain experts for contextualizing social media data in order to create inclusive, community-informed algorithms. Utilizing data from the Gang Intervention and Computer Science Project-a comprehensive analysis of Twitter data from gang-involved youth in Chicago-we describe the process of involving formerly gang-involved young people in developing a new part-of-speech tagger and content classifier for a prototype natural language processing system that detects aggression and loss in Twitter data. We argue that involving young people as domain experts leads to more robust understandings of context, including localized language, culture, and events. These insights could change how data scientists approach the development of corpora and algorithms that affect people in marginalized communities and who to involve in that process. We offer a contextually driven interdisciplinary approach between social work and data science that integrates domain insights into the training of qualitative annotators and the production of algorithms for positive social impact.
挖掘社交媒体数据以研究人类状况带来了新的独特挑战。在分析来自边缘化社区的社交媒体数据时,算法缺乏准确解读线下背景的能力,这可能导致对边缘化社区产生危险的假设和影响。为应对这一挑战,我们聘请曾涉帮派的年轻人作为领域专家,对社交媒体数据进行背景分析,以创建具有包容性、基于社区信息的算法。利用“帮派干预与计算机科学项目”的数据——对芝加哥涉帮派青年的推特数据进行的全面分析——我们描述了让曾涉帮派的年轻人参与为一个原型自然语言处理系统开发新的词性标注器和内容分类器的过程,该系统用于检测推特数据中的攻击性和失落情绪。我们认为,让年轻人作为领域专家参与进来能带来对背景更深入的理解,包括地方语言、文化和事件。这些见解可能会改变数据科学家开发语料库和算法的方式,以及在这个过程中涉及哪些人,而这些语料库和算法会影响边缘化社区的人群。我们提供了一种社会工作与数据科学之间基于背景驱动的跨学科方法,将领域见解整合到定性注释员的培训以及用于产生积极社会影响的算法生成中。