用于差分隐私联邦学习的广义基因组数据共享

Generalized genomic data sharing for differentially private federated learning.

作者信息

Aziz Md Momin Al, Anjum Md Monowar, Mohammed Noman, Jiang Xiaoqian

机构信息

Computer Science, University of Manitoba, 66 Chancellors Circle, Winnipeg R3T 2N2, Manitoba, Canada.

出版信息

J Biomed Inform. 2022 Aug;132:104113. doi: 10.1016/j.jbi.2022.104113. Epub 2022 Jun 9.

DOI:10.1016/j.jbi.2022.104113

PMID:35690350

Abstract

The success behind Machine Learning (ML) methods has largely been attributed to the quality and quantity of the available data which can spread across multiple owners. A Federated Learning (FL) from distributed datasets often provides a reliable solution that provides valuable insight. For a genomic dataset, such data have also proven to be sensitive which requires additional safety mechanisms before any sharing or ML operations. We propose a generalized gene expression data sharing method using a differentially private mechanism. Due to the large number of genes available, the data dimension is also reduced to accommodate smaller privacy budgets as we utilize an exponential mechanism to create a private histogram from numeric expression data. The output histogram can be used in any federated machine learning setting having multiple data owners. The proposed solution was submitted to genomic data security and privacy competition, iDash 2020 where it ranked third among 55 teams. We extend the proposed solution and experimented with two different machine learning algorithms and different settings. The experimental results show that it takes around 8 s to train a model while achieving 0.89 AUC with only a privacy budget of 5. The paper outlined a method to share gene expression data for Federated Learning using a privacy-preserving mechanism. Different experimental settings and recent competition results show the efficacy of the method which can be further extended to other genomic datasets and machine learning algorithms.

摘要

机器学习（ML）方法背后的成功很大程度上归因于可获取的数据的质量和数量，这些数据可能分散在多个所有者手中。来自分布式数据集的联邦学习（FL）通常提供一种可靠的解决方案，并能提供有价值的见解。对于基因组数据集，这类数据也已被证明具有敏感性，在进行任何共享或机器学习操作之前需要额外的安全机制。我们提出一种使用差分隐私机制的广义基因表达数据共享方法。由于可用基因数量众多，我们利用指数机制从数值表达数据创建一个私有直方图，从而降低数据维度以适应较小的隐私预算。输出的直方图可用于任何具有多个数据所有者的联邦机器学习设置中。所提出的解决方案已提交至基因组数据安全与隐私竞赛iDash 2020，在55支参赛队伍中排名第三。我们扩展了所提出的解决方案，并对两种不同的机器学习算法和不同设置进行了实验。实验结果表明，在仅5的隐私预算下，训练一个模型大约需要8秒，同时达到0.89的曲线下面积（AUC）。本文概述了一种使用隐私保护机制为联邦学习共享基因表达数据的方法。不同的实验设置和近期竞赛结果表明了该方法的有效性，它可进一步扩展到其他基因组数据集和机器学习算法。