基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

机构信息

Department of Mathematics, University of Tulsa, Tulsa, OK 74104, USA.

Laureate Institute for Brain Research, Tulsa, OK 74136, USA.

出版信息

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

DOI:10.1093/bioinformatics/btx298

PMID:28472232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870708/

Abstract

MOTIVATION

Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting.

METHODS

We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection.

RESULTS

On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder.

AVAILABILITY AND IMPLEMENTATION

Code available at http://insilico.utulsa.edu/software/privateEC .

CONTACT

brett-mckinney@utulsa.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

从高维生物数据中以低预测误差将个体分类为疾病或临床类别是生物信息学中统计学习的一个重要挑战。特征选择可以提高分类准确性，但必须仔细纳入交叉验证中，以避免过拟合。最近，已经提出了基于差分隐私的特征选择方法，例如差分隐私随机森林和可重用保留集。然而，对于生物信息学等领域，特征数量远大于观测值数量（p≫n），这些差分隐私方法容易出现过拟合。

方法

我们引入了私有蒸发冷却，这是一种随机隐私保护机器学习算法，它使用 Relief-F 进行特征选择，使用随机森林进行隐私保护分类，同时防止过拟合。我们将隐私保护阈值机制与热力学麦克斯韦-玻尔兹曼分布相关联，其中温度表示隐私阈值。我们使用原子气体的蒸发冷却的热统计物理概念来执行向后逐步隐私保护特征选择。

结果

在具有主效应和统计交互作用的模拟数据上，我们比较了三种隐私保护方法（可重用保留集、可重用保留集与随机森林和使用 Relief-F 特征选择和随机森林分类的私有蒸发冷却）在保留集和验证集上的准确性。在存在属性之间交互作用的模拟中，私有蒸发冷却在基于独立验证集的情况下提供了更高的分类准确性，而不会出现过拟合。在没有交互作用的模拟中，随机森林和私有蒸发冷却的阈值输出具有可比的准确性。我们还将这些隐私方法应用于重度抑郁症研究中的人类大脑静息状态 fMRI 数据。

可用性和实现

代码可在 http://insilico.utulsa.edu/software/privateEC 获得。

联系人

brett-mckinney@utulsa.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

Consensus features nested cross-validation.

Bioinformatics. 2020 May 1;36(10):3093-3098. doi: 10.1093/bioinformatics/btaa046.

STatistical Inference Relief (STIR) feature selection.

Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.

ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

PLoS One. 2013 Dec 10;8(12):e81527. doi: 10.1371/journal.pone.0081527. eCollection 2013.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.

Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.

A new approach for interpreting Random Forest models and its application to the biology of ageing.

Bioinformatics. 2018 Jul 15;34(14):2449-2456. doi: 10.1093/bioinformatics/bty087.

Privacy-Preserving Federated Model Predicting Bipolar Transition in Patients With Depression: Prediction Model Development Study.

J Med Internet Res. 2023 Jul 20;25:e46165. doi: 10.2196/46165.

Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis.

PLoS Genet. 2009 Mar;5(3):e1000432. doi: 10.1371/journal.pgen.1000432. Epub 2009 Mar 20.

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

引用本文的文献

Multivariate Optimization of k for k-Nearest-Neighbor Feature Selection With Dichotomous Outcomes: Complex Associations, Class Imbalance, and Application to RNA-Seq in Major Depressive Disorder.

IEEE Trans Comput Biol Bioinform. 2025 Jan-Feb;22(1):39-51. doi: 10.1109/TCBBIO.2024.3494599.

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification.

Sci Rep. 2025 May 16;15(1):17094. doi: 10.1038/s41598-025-01873-8.

Disulfidptosis-related genes serve as potential prognostic biomarkers and indicate tumor microenvironment characteristics and immunotherapy response in prostate cancer.

Sci Rep. 2024 Jun 19;14(1):14107. doi: 10.1038/s41598-024-61679-y.

Prediction and optimization method for welding quality of components in ship construction.

Sci Rep. 2024 Apr 23;14(1):9353. doi: 10.1038/s41598-024-59490-w.

Novel HLA associations with outcomes of Mycobacterium tuberculosis exposure and sarcoidosis in individuals of African ancestry using nearest-neighbor feature selection.

Genet Epidemiol. 2022 Oct;46(7):463-474. doi: 10.1002/gepi.22490. Epub 2022 Jun 14.

Differential privacy in health research: A scoping review.

J Am Med Inform Assoc. 2021 Sep 18;28(10):2269-2276. doi: 10.1093/jamia/ocab135.

Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.

PLoS One. 2021 Feb 8;16(2):e0246761. doi: 10.1371/journal.pone.0246761. eCollection 2021.

Random-forest algorithm based biomarkers in predicting prognosis in the patients with hepatocellular carcinoma.

Cancer Cell Int. 2020 Jun 17;20:251. doi: 10.1186/s12935-020-01274-z. eCollection 2020.

AgeGuess, a Methylomic Prediction Model for Human Ages.

Front Bioeng Biotechnol. 2020 Mar 10;8:80. doi: 10.3389/fbioe.2020.00080. eCollection 2020.

Consensus features nested cross-validation.

Bioinformatics. 2020 May 1;36(10):3093-3098. doi: 10.1093/bioinformatics/btaa046.

本文引用的文献

The feature selection bias problem in relation to high-dimensional gene data.

Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.

STATISTICS. The reusable holdout: Preserving validity in adaptive data analysis.

Science. 2015 Aug 7;349(6248):636-8. doi: 10.1126/science.aaa9375.

Resting-state functional connectivity in major depressive disorder: A review.

Neurosci Biobehav Rev. 2015 Sep;56:330-44. doi: 10.1016/j.neubiorev.2015.07.014. Epub 2015 Jul 30.

Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure.

BioData Min. 2015 Feb 3;8:5. doi: 10.1186/s13040-015-0040-x. eCollection 2015.

Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge.

BMC Med Inform Decis Mak. 2014;14 Suppl 1(Suppl 1):S3. doi: 10.1186/1472-6947-14-S1-S3. Epub 2014 Dec 8.

Resting state networks in major depressive disorder.

Psychiatry Res. 2014 Dec 30;224(3):139-51. doi: 10.1016/j.pscychresns.2014.10.003. Epub 2014 Oct 13.

Insular dysfunction within the salience network is associated with severity of symptoms and aberrant inter-network connectivity in major depressive disorder.

Front Hum Neurosci. 2014 Jan 21;7:930. doi: 10.3389/fnhum.2013.00930. eCollection 2013.

ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

PLoS One. 2013 Dec 10;8(12):e81527. doi: 10.1371/journal.pone.0081527. eCollection 2013.

Identify changes of brain regional homogeneity in bipolar disorder and unipolar depression using resting-state FMRI.

PLoS One. 2013 Dec 4;8(12):e79999. doi: 10.1371/journal.pone.0079999. eCollection 2013.

Revisiting default mode network function in major depression: evidence for disrupted subsystem connectivity.

Psychol Med. 2014 Jul;44(10):2041-51. doi: 10.1017/S0033291713002596. Epub 2013 Oct 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

机构信息

Department of Mathematics, University of Tulsa, Tulsa, OK 74104, USA.

Laureate Institute for Brain Research, Tulsa, OK 74136, USA.

出版信息

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

DOI:10.1093/bioinformatics/btx298

PMID:28472232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870708/

Abstract

MOTIVATION

METHODS

RESULTS

AVAILABILITY AND IMPLEMENTATION

Code available at http://insilico.utulsa.edu/software/privateEC .

CONTACT

brett-mckinney@utulsa.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

方法

结果

可用性和实现

代码可在 http://insilico.utulsa.edu/software/privateEC 获得。

联系人

brett-mckinney@utulsa.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

方法

结果

可用性和实现

联系人

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

方法

结果

可用性和实现

联系人

补充信息