Suppr超能文献

数据及其(不)内容:机器学习研究中数据集开发与使用的调查

Data and its (dis)contents: A survey of dataset development and use in machine learning research.

作者信息

Paullada Amandalynne, Raji Inioluwa Deborah, Bender Emily M, Denton Emily, Hanna Alex

机构信息

Department of Linguistics, University of Washington, Seattle, WA, USA.

Mozilla Foundation, Mountain View, CA, USA.

出版信息

Patterns (N Y). 2021 Nov 12;2(11):100336. doi: 10.1016/j.patter.2021.100336.

Abstract

In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases.

摘要

在这项工作中,我们审视了大量文献,这些文献揭示了机器学习领域中数据集收集和使用的主流做法的局限性。我们涵盖了一些批判性地审视数据集设计与开发的研究,重点关注其负面社会影响以及系统性能不佳的结果。我们还涵盖了数据过滤与扩充方法以及旨在减轻数据集中偏差影响的建模技术。最后,我们讨论了研究数据实践、文化和学科规范的著作,并探讨了该领域持续面临的法律、伦理和功能挑战的影响。基于这些发现,我们主张在数据集的创建和使用阶段采用定性和定量方法,以便更仔细地记录和分析数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a0f/8600147/6485db44eb65/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验