Tomasoni Danilo, Lombardo Rosario, Lauria Mario
Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy.
Department of Economics, University of Verona, Verona, Italy.
Front Genet. 2024 Jan 29;15:1270387. doi: 10.3389/fgene.2024.1270387. eCollection 2024.
Preserving data privacy is an important concern in the research use of patient data. The DataSHIELD suite enables privacy-aware advanced statistical analysis in a federated setting. Despite its many applications, it has a few open practical issues: the complexity of hosting a federated infrastructure, the performance penalty imposed by the privacy-preserving constraints, and the ease of use by non-technical users. In this work, we describe a case study in which we review different breast cancer classifiers and report our findings about the limits and advantages of such non-disclosive suite of tools in a realistic setting. Five independent gene expression datasets of breast cancer survival were downloaded from Gene Expression Omnibus (GEO) and pooled together through the federated infrastructure. Three previously published and two newly proposed 5-year cancer-free survival risk score classifiers were trained in a federated environment, and an additional reference classifier was trained with unconstrained data access. The performance of these six classifiers was systematically evaluated, and the results show that i) the published classifiers do not generalize well when applied to patient cohorts that differ from those used to develop them; ii) among the methods we tried, the classification using logistic regression worked better on average, closely followed by random forest; iii) the unconstrained version of the logistic regression classifier outperformed the federated version by 4 on average. Reproducibility of our experiments is ensured through the use of VisualSHIELD, an open-source tool that augments DataSHIELD with new functions, a standardized deployment procedure, and a simple graphical user interface.
在患者数据的研究使用中,保护数据隐私是一个重要问题。DataSHIELD套件能够在联邦环境中进行隐私感知的高级统计分析。尽管它有许多应用,但仍存在一些实际的开放性问题:托管联邦基础设施的复杂性、隐私保护约束带来的性能损失,以及非技术用户的易用性。在这项工作中,我们描述了一个案例研究,其中我们回顾了不同的乳腺癌分类器,并报告了我们在实际环境中关于这种非披露性工具套件的局限性和优势的发现。从基因表达综合数据库(GEO)下载了五个独立的乳腺癌生存基因表达数据集,并通过联邦基础设施将它们汇总在一起。在联邦环境中训练了三个先前发表的和两个新提出的5年无癌生存风险评分分类器,并使用无约束数据访问训练了一个额外的参考分类器。系统地评估了这六个分类器的性能,结果表明:i)当应用于与用于开发它们的患者队列不同的患者队列时,已发表的分类器泛化效果不佳;ii)在我们尝试的方法中,使用逻辑回归的分类平均效果更好,其次是随机森林;iii)逻辑回归分类器的无约束版本平均比联邦版本高出4分。通过使用VisualSHIELD确保了我们实验的可重复性,VisualSHIELD是一个开源工具,它通过新功能、标准化部署程序和简单的图形用户界面增强了DataSHIELD。