Department of Mathematics, University of Tulsa, Tulsa, OK 74104, USA.
Laureate Institute for Brain Research, Tulsa, OK 74136, USA.
Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.
Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting.
We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection.
On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder.
Code available at http://insilico.utulsa.edu/software/privateEC .
Supplementary data are available at Bioinformatics online.
从高维生物数据中以低预测误差将个体分类为疾病或临床类别是生物信息学中统计学习的一个重要挑战。特征选择可以提高分类准确性,但必须仔细纳入交叉验证中,以避免过拟合。最近,已经提出了基于差分隐私的特征选择方法,例如差分隐私随机森林和可重用保留集。然而,对于生物信息学等领域,特征数量远大于观测值数量(p≫n),这些差分隐私方法容易出现过拟合。
我们引入了私有蒸发冷却,这是一种随机隐私保护机器学习算法,它使用 Relief-F 进行特征选择,使用随机森林进行隐私保护分类,同时防止过拟合。我们将隐私保护阈值机制与热力学麦克斯韦-玻尔兹曼分布相关联,其中温度表示隐私阈值。我们使用原子气体的蒸发冷却的热统计物理概念来执行向后逐步隐私保护特征选择。
在具有主效应和统计交互作用的模拟数据上,我们比较了三种隐私保护方法(可重用保留集、可重用保留集与随机森林和使用 Relief-F 特征选择和随机森林分类的私有蒸发冷却)在保留集和验证集上的准确性。在存在属性之间交互作用的模拟中,私有蒸发冷却在基于独立验证集的情况下提供了更高的分类准确性,而不会出现过拟合。在没有交互作用的模拟中,随机森林和私有蒸发冷却的阈值输出具有可比的准确性。我们还将这些隐私方法应用于重度抑郁症研究中的人类大脑静息状态 fMRI 数据。
代码可在 http://insilico.utulsa.edu/software/privateEC 获得。
补充数据可在 Bioinformatics 在线获得。