隐性数据犯罪：因公共数据滥用导致的机器学习偏差

Implicit data crimes: Machine learning bias arising from misuse of public data.

作者信息

Shimron Efrat, Tamir Jonathan I, Wang Ke, Lustig Michael

机构信息

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712.

出版信息

Proc Natl Acad Sci U S A. 2022 Mar 29;119(13):e2117203119. doi: 10.1073/pnas.2117203119. Epub 2022 Mar 21.

DOI:10.1073/pnas.2117203119

PMID:35312366

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9060447/

Abstract

SignificancePublic databases are an important resource for machine learning research, but their growing availability sometimes leads to "off-label" usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit "data crimes" to raise community awareness of this growing big data problem.

摘要

意义

公共数据库是机器学习研究的重要资源，但它们日益增加的可用性有时会导致“标签外”使用，即将为一项任务发布的数据用于另一项任务。这项工作表明，这种标签外使用可能会导致机器学习算法产生有偏差的、过于乐观的结果。根本原因是公共数据是通过改变数据特征的隐藏处理管道进行处理的。在这里，我们研究了三种为从磁共振成像测量中进行图像重建而开发的著名算法，并表明当应用于公共数据库时，它们可能会产生有偏差的结果，人工改进高达48%。我们将此类结果的发表视为隐性“数据犯罪”，以提高社区对这个日益严重的大数据问题的认识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e24/9060447/60ec6b7cb23d/pnas.2117203119fig01.jpg

相似文献

Implicit data crimes: Machine learning bias arising from misuse of public data.隐性数据犯罪：因公共数据滥用导致的机器学习偏差

Proc Natl Acad Sci U S A. 2022 Mar 29;119(13):e2117203119. doi: 10.1073/pnas.2117203119. Epub 2022 Mar 21.

Machine learning in Magnetic Resonance Imaging: Image reconstruction.机器学习在磁共振成像中的应用：图像重建。

Phys Med. 2021 Mar;83:79-87. doi: 10.1016/j.ejmp.2021.02.020. Epub 2021 Mar 13.

Evolution and impact of bias in human and machine learning algorithm interaction.人类与机器学习算法交互中的偏差演变与影响。

PLoS One. 2020 Aug 13;15(8):e0235502. doi: 10.1371/journal.pone.0235502. eCollection 2020.

Image Reconstruction is a New Frontier of Machine Learning.图像重建是机器学习的一个新领域。

IEEE Trans Med Imaging. 2018 Jun;37(6):1289-1296. doi: 10.1109/TMI.2018.2833635.

Involvement of Machine Learning Tools in Healthcare Decision Making.机器学习工具在医疗保健决策中的应用。

J Healthc Eng. 2021 Jan 27;2021:6679512. doi: 10.1155/2021/6679512. eCollection 2021.

Detection of Diseases Using Machine Learning Image Recognition Technology in Artificial Intelligence.利用人工智能中的机器学习图像识别技术检测疾病。

Comput Intell Neurosci. 2022 Apr 13;2022:5658641. doi: 10.1155/2022/5658641. eCollection 2022.

Learning-based 3T brain MRI segmentation with guidance from 7T MRI labeling.基于学习的3T脑磁共振成像分割，由7T磁共振成像标记引导。

Med Phys. 2016 Dec;43(12):6588-6597. doi: 10.1118/1.4967487.

Extreme Learning Machine (ELM)-Based Classification of Benign and Malignant Cells in Breast Cancer.基于极限学习机（ELM）的乳腺癌良恶性细胞分类。

Med Sci Monit. 2018 Sep 17;24:6537-6543. doi: 10.12659/MSM.910520.

Big Data Approaches to Phenotyping Acute Ischemic Stroke Using Automated Lesion Segmentation of Multi-Center Magnetic Resonance Imaging Data.利用多中心磁共振成像数据的自动病灶分割对急性缺血性脑卒中进行表型分析的大数据方法。

Stroke. 2019 Jul;50(7):1734-1741. doi: 10.1161/STROKEAHA.119.025373. Epub 2019 Jun 10.

Inpainted Image Reconstruction Using an Extended Hopfield Neural Network Based Machine Learning System.基于扩展的 Hopfield 神经网络的机器学习系统的图像修复。

Sensors (Basel). 2022 Jan 21;22(3):813. doi: 10.3390/s22030813.

引用本文的文献

Machine learning to evaluate the effects of non-clinical social determinant features in predicting colorectal Cancer mortality in a medically underserved Appalachian population.机器学习用于评估非临床社会决定因素特征在预测医疗服务不足的阿巴拉契亚人群结直肠癌死亡率中的作用。

Sci Rep. 2025 Jul 16;15(1):25781. doi: 10.1038/s41598-025-11074-y.

Diffusion probabilistic generative models for accelerated, in-NICU permanent magnet neonatal MRI.用于加速的新生儿重症监护病房永磁新生儿MRI的扩散概率生成模型。

Magn Reson Med. 2025 Oct;94(4):1546-1562. doi: 10.1002/mrm.30585. Epub 2025 Jun 17.

Rethinking MRI as a measurement device through modular and portable pipelines.通过模块化和便携式流程重新思考将磁共振成像作为一种测量设备。

MAGMA. 2025 Apr 24. doi: 10.1007/s10334-025-01245-3.

MRI acquisition and reconstruction cookbook: recipes for reproducibility, served with real-world flavour.《MRI采集与重建手册：实现可重复性的方法，融入真实应用场景》

MAGMA. 2025 Mar 6. doi: 10.1007/s10334-025-01236-4.

The scientific evidence of commercial AI products for MRI acceleration: a systematic review.用于磁共振成像加速的商用人工智能产品的科学证据：一项系统综述

Eur Radiol. 2025 Feb 19. doi: 10.1007/s00330-025-11423-5.

FastMRI Breast: A Publicly Available Radial k-Space Dataset of Breast Dynamic Contrast-enhanced MRI.快速磁共振成像乳腺：一个公开可用的乳腺动态对比增强磁共振成像的径向k空间数据集。

Radiol Artif Intell. 2025 Jan;7(1):e240345. doi: 10.1148/ryai.240345.

Multi-task magnetic resonance imaging reconstruction using meta-learning.基于元学习的多任务磁共振成像重建

Magn Reson Imaging. 2025 Feb;116:110278. doi: 10.1016/j.mri.2024.110278. Epub 2024 Nov 22.

Assessing personalized molecular portraits underlying endothelial-to-mesenchymal transition within pulmonary arterial hypertension.评估肺动脉高压中内皮细胞向间充质转化的个体化分子特征。

Mol Med. 2024 Oct 26;30(1):189. doi: 10.1186/s10020-024-00963-z.

The intelligent imaging revolution: artificial intelligence in MRI and MRS acquisition and reconstruction.智能成像革命：人工智能在磁共振成像和磁共振波谱采集与重建中的应用

MAGMA. 2024 Jul;37(3):329-333. doi: 10.1007/s10334-024-01179-2. Epub 2024 Jun 20.

Accelerated MRI reconstructions via variational network and feature domain learning.基于变分网络和特征域学习的加速 MRI 重建。

Sci Rep. 2024 May 14;14(1):10991. doi: 10.1038/s41598-024-59705-0.

本文引用的文献

Solving Inverse Problems With Deep Neural Networks - Robustness Included?使用深度神经网络解决逆问题——包括鲁棒性吗？

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):1119-1134. doi: 10.1109/TPAMI.2022.3148324. Epub 2022 Dec 5.

Evaluation on the generalization of a learned convolutional neural network for MRI reconstruction.用于磁共振成像（MRI）重建的学习型卷积神经网络的泛化性评估。

Magn Reson Imaging. 2022 Apr;87:38-46. doi: 10.1016/j.mri.2021.12.003. Epub 2021 Dec 27.

A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images.多说话人原始和重建语音产生实时 MRI 视频及 3D 容积图像数据集。

Sci Data. 2021 Jul 20;8(1):187. doi: 10.1038/s41597-021-00976-x.

Results of the 2020 fastMRI Challenge for Machine Learning MR Image Reconstruction.2020 年快速 MRI 挑战赛机器学习磁共振图像重建结果。

IEEE Trans Med Imaging. 2021 Sep;40(9):2306-2317. doi: 10.1109/TMI.2021.3075856. Epub 2021 Aug 31.

Boosting the signal-to-noise of low-field MRI with deep learning image reconstruction.深度学习图像重建提高低场 MRI 的信噪。

Sci Rep. 2021 Apr 15;11(1):8248. doi: 10.1038/s41598-021-87482-7.

Deep-Learning Methods for Parallel Magnetic Resonance Imaging Reconstruction: A Survey of the Current Approaches, Trends, and Issues.用于并行磁共振成像重建的深度学习方法：当前方法、趋势及问题综述

IEEE Signal Process Mag. 2020 Jan;37(1):128-140. doi: 10.1109/MSP.2019.2950640. Epub 2020 Jan 20.

On instabilities of deep learning in image reconstruction and the potential costs of AI.深度学习在图像重建中的不稳定性及人工智能的潜在代价

Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30088-30095. doi: 10.1073/pnas.1907377117. Epub 2020 May 11.

Optimization Methods for Magnetic Resonance Image Reconstruction: Key Models and Optimization Algorithms.磁共振图像重建的优化方法：关键模型与优化算法

IEEE Signal Process Mag. 2020 Jan;37(1):33-40. doi: 10.1109/MSP.2019.2943645. Epub 2020 Jan 17.

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal.COVID-19 诊断和预后预测模型：系统评价和批判性评估。

BMJ. 2020 Apr 7;369:m1328. doi: 10.1136/bmj.m1328.

Image Reconstruction: From Sparsity to Data-adaptive Methods and Machine Learning.图像重建：从稀疏性到数据自适应方法与机器学习

Proc IEEE Inst Electr Electron Eng. 2020 Jan;108(1):86-109. doi: 10.1109/JPROC.2019.2936204. Epub 2019 Sep 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

隐性数据犯罪：因公共数据滥用导致的机器学习偏差

Implicit data crimes: Machine learning bias arising from misuse of public data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献