Suppr超能文献

隐性数据犯罪:因公共数据滥用导致的机器学习偏差

Implicit data crimes: Machine learning bias arising from misuse of public data.

作者信息

Shimron Efrat, Tamir Jonathan I, Wang Ke, Lustig Michael

机构信息

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712.

出版信息

Proc Natl Acad Sci U S A. 2022 Mar 29;119(13):e2117203119. doi: 10.1073/pnas.2117203119. Epub 2022 Mar 21.

Abstract

SignificancePublic databases are an important resource for machine learning research, but their growing availability sometimes leads to "off-label" usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit "data crimes" to raise community awareness of this growing big data problem.

摘要

意义

公共数据库是机器学习研究的重要资源,但它们日益增加的可用性有时会导致“标签外”使用,即将为一项任务发布的数据用于另一项任务。这项工作表明,这种标签外使用可能会导致机器学习算法产生有偏差的、过于乐观的结果。根本原因是公共数据是通过改变数据特征的隐藏处理管道进行处理的。在这里,我们研究了三种为从磁共振成像测量中进行图像重建而开发的著名算法,并表明当应用于公共数据库时,它们可能会产生有偏差的结果,人工改进高达48%。我们将此类结果的发表视为隐性“数据犯罪”,以提高社区对这个日益严重的大数据问题的认识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e24/9060447/60ec6b7cb23d/pnas.2117203119fig01.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验