Paullada Amandalynne, Raji Inioluwa Deborah, Bender Emily M, Denton Emily, Hanna Alex
Department of Linguistics, University of Washington, Seattle, WA, USA.
Mozilla Foundation, Mountain View, CA, USA.
Patterns (N Y). 2021 Nov 12;2(11):100336. doi: 10.1016/j.patter.2021.100336.
In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases.
在这项工作中,我们审视了大量文献,这些文献揭示了机器学习领域中数据集收集和使用的主流做法的局限性。我们涵盖了一些批判性地审视数据集设计与开发的研究,重点关注其负面社会影响以及系统性能不佳的结果。我们还涵盖了数据过滤与扩充方法以及旨在减轻数据集中偏差影响的建模技术。最后,我们讨论了研究数据实践、文化和学科规范的著作,并探讨了该领域持续面临的法律、伦理和功能挑战的影响。基于这些发现,我们主张在数据集的创建和使用阶段采用定性和定量方法,以便更仔细地记录和分析数据集。