IEEE Trans Med Imaging. 2023 Jul;42(7):2081-2090. doi: 10.1109/TMI.2022.3220706. Epub 2023 Jun 30.
Dataset auditing for machine learning (ML) models is a method to evaluate if a given dataset is used in training a model. In a Federated Learning setting where multiple institutions collaboratively train a model with their decentralized private datasets, dataset auditing can facilitate the enforcement of regulations, which provide rules for preserving privacy, but also allow users to revoke authorizations and remove their data from collaboratively trained models. This paper first proposes a set of requirements for a practical dataset auditing method, and then present a novel dataset auditing method called Ensembled Membership Auditing ( EMA ). Its key idea is to leverage previously proposed Membership Inference Attack methods and to aggregate data-wise membership scores using statistic testing to audit a dataset for a ML model. We have experimentally evaluated the proposed approach with benchmark datasets, as well as 4 X-ray datasets (CBIS-DDSM, COVIDx, Child-XRay, and CXR-NIH) and 3 dermatology datasets (DERM7pt, HAM10000, and PAD-UFES-20). Our results show that EMA meet the requirements substantially better than the previous state-of-the-art method. Our code is at:https://github.com/Hazelsuko07/EMA.
数据集审核是一种评估给定数据集是否用于训练模型的方法。在联邦学习环境中,多个机构使用其分散的私有数据集共同训练模型,数据集审核可以促进法规的执行,这些法规规定了保护隐私的规则,同时也允许用户撤销授权并从共同训练的模型中删除他们的数据。本文首先提出了一套实用的数据集审核方法的要求,然后提出了一种名为集成成员审核(EMA)的新数据集审核方法。其核心思想是利用先前提出的成员推断攻击方法,并使用统计检验来汇总数据级别的成员分数,以审核 ML 模型的数据集。我们使用基准数据集以及 4 个 X 射线数据集(CBIS-DDSM、COVIDx、Child-XRay 和 CXR-NIH)和 3 个皮肤科数据集(DERM7pt、HAM10000 和 PAD-UFES-20)对所提出的方法进行了实验评估。我们的结果表明,EMA 满足要求的程度明显优于以前的最先进方法。我们的代码在:https://github.com/Hazelsuko07/EMA。