Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland.
Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
Bioinformatics. 2019 Jul 15;35(14):i218-i224. doi: 10.1093/bioinformatics/btz373.
Human genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information.
We study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction.
Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer.
人类基因组数据集通常包含敏感信息,限制了数据的使用和共享。特别是,简单的匿名化策略无法为基因组数据提供足够的保护水平,因为数据本身是可识别的。差分隐私机器学习可以通过保证发布的结果不会泄露太多关于任何单个数据点的信息来提供帮助。最近的研究在使用基因表达数据进行差分隐私药物敏感性预测方面取得了有希望的结果。使用基因组数据进行差分隐私学习具有挑战性,因为在高维空间中更难保证隐私。降维可以提供帮助,但是如果降维映射是从数据中学习的,那么它也需要是差分隐私的,这可能会带来巨大的隐私成本。此外,任何超参数(如目标维度)的选择也需要避免泄露私人信息。
我们研究了一种使用类似类型的大型公共数据集来学习差分隐私学习的紧凑表示的方法。我们比较了三种表示学习方法:变分自动编码器、主成分分析和随机投影。我们在癌细胞系的基因表达上解决了两个机器学习任务:癌症类型分类和药物敏感性预测。实验表明,所有表示学习方法都有显著的益处,其中变分自动编码器最常提供最准确的预测。我们的结果在差分隐私药物敏感性预测的准确性方面显著优于以前的最先进水平。
实验中使用的代码可在 https://github.com/DPBayes/dp-representation-transfer 上获得。