Department of Computer and Informatics Sciences, Temple University, Philadelphia, PA 19122, USA.
Pac Symp Biocomput. 2021;26:26-37.
Machine learning is powerful to model massive genomic data while genome privacy is a growing concern. Studies have shown that not only the raw data but also the trained model can potentially infringe genome privacy. An example is the membership inference attack (MIA), by which the adversary can determine whether a specific record was included in the training dataset of the target model. Differential privacy (DP) has been used to defend against MIA with rigorous privacy guarantee by perturbing model weights. In this paper, we investigate the vulnerability of machine learning against MIA on genomic data, and evaluate the effectiveness of using DP as a defense mechanism. We consider two widely-used machine learning models, namely Lasso and convolutional neural network (CNN), as the target models. We study the trade-off between the defense power against MIA and the prediction accuracy of the target model under various privacy settings of DP. Our results show that the relationship between the privacy budget and target model accuracy can be modeled as a log-like curve, thus a smaller privacy budget provides stronger privacy guarantee with the cost of losing more model accuracy. We also investigate the effect of model sparsity on model vulnerability against MIA. Our results demonstrate that in addition to prevent overfitting, model sparsity can work together with DP to significantly mitigate the risk of MIA.
机器学习在对大规模基因组数据进行建模方面具有强大的功能,而基因组隐私是一个日益受到关注的问题。研究表明,不仅原始数据,而且训练后的模型都有可能侵犯基因组隐私。一个例子是成员推断攻击(MIA),通过这种攻击,对手可以确定特定记录是否包含在目标模型的训练数据集中。差分隐私(DP)已被用于通过扰动模型权重来抵御 MIA,并提供严格的隐私保证。在本文中,我们研究了机器学习在基因组数据上对 MIA 的脆弱性,并评估了使用 DP 作为防御机制的有效性。我们考虑了两种广泛使用的机器学习模型,即 Lasso 和卷积神经网络(CNN),作为目标模型。我们研究了在 DP 的各种隐私设置下,针对 MIA 的防御能力与目标模型准确性之间的权衡。我们的结果表明,隐私预算与目标模型准确性之间的关系可以建模为对数曲线,因此较小的隐私预算可以提供更强的隐私保证,代价是损失更多的模型准确性。我们还研究了模型稀疏性对模型对 MIA 的脆弱性的影响。我们的结果表明,除了防止过拟合之外,模型稀疏性还可以与 DP 一起显著降低 MIA 的风险。