Department of Information Systems and Cyber Security, University of Texas at San Antonio, San Antonio, Texas, USA.
J Am Med Inform Assoc. 2021 Mar 18;28(4):839-849. doi: 10.1093/jamia/ocaa326.
Machine learning is used to understand and track influenza-related content on social media. Because these systems are used at scale, they have the potential to adversely impact the people they are built to help. In this study, we explore the biases of different machine learning methods for the specific task of detecting influenza-related content. We compare the performance of each model on tweets written in Standard American English (SAE) vs African American English (AAE).
Two influenza-related datasets are used to train 3 text classification models (support vector machine, convolutional neural network, bidirectional long short-term memory) with different feature sets. The datasets match real-world scenarios in which there is a large imbalance between SAE and AAE examples. The number of AAE examples for each class ranges from 2% to 5% in both datasets. We also evaluate each model's performance using a balanced dataset via undersampling.
We find that all of the tested machine learning methods are biased on both datasets. The difference in false positive rates between SAE and AAE examples ranges from 0.01 to 0.35. The difference in the false negative rates ranges from 0.01 to 0.23. We also find that the neural network methods generally has more unfair results than the linear support vector machine on the chosen datasets.
The models that result in the most unfair predictions may vary from dataset to dataset. Practitioners should be aware of the potential harms related to applying machine learning to health-related social media data. At a minimum, we recommend evaluating fairness along with traditional evaluation metrics.
机器学习被用于理解和跟踪社交媒体上与流感相关的内容。由于这些系统被大规模使用,它们有可能对其旨在帮助的人群产生不利影响。在这项研究中,我们探讨了不同机器学习方法在检测与流感相关内容的特定任务中的偏见。我们比较了每个模型在标准美式英语(SAE)与非裔美国英语(AAE)撰写的推文中的性能。
使用两个与流感相关的数据集来训练 3 个具有不同特征集的文本分类模型(支持向量机、卷积神经网络、双向长短时记忆)。这些数据集与现实世界中的情况相匹配,即 SAE 和 AAE 示例之间存在很大的不平衡。在两个数据集中,每个类的 AAE 示例数量从 2%到 5%不等。我们还通过欠采样评估了每个模型在平衡数据集中的性能。
我们发现,所有测试的机器学习方法在两个数据集上都存在偏差。SAE 和 AAE 示例之间的假阳性率差异范围为 0.01 到 0.35。假阴性率的差异范围为 0.01 到 0.23。我们还发现,在所选择的数据集上,神经网络方法通常比线性支持向量机产生更不公平的结果。
导致最不公平预测的模型可能因数据集而异。从业者应该意识到将机器学习应用于与健康相关的社交媒体数据相关的潜在危害。至少,我们建议在传统评估指标之外,还评估公平性。