Georgia Institute of Technology, Atlanta, Georgia, USA.
Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, and Emory University, Atlanta, Georgia, USA.
Hum Brain Mapp. 2022 May;43(7):2289-2310. doi: 10.1002/hbm.25788. Epub 2022 Mar 4.
Privacy concerns for rare disease data, institutional or IRB policies, access to local computational or storage resources or download capabilities are among the reasons that may preclude analyses that pool data to a single site. A growing number of multisite projects and consortia were formed to function in the federated environment to conduct productive research under constraints of this kind. In this scenario, a quality control tool that visualizes decentralized data in its entirety via global aggregation of local computations is especially important, as it would allow the screening of samples that cannot be jointly evaluated otherwise. To solve this issue, we present two algorithms: decentralized data stochastic neighbor embedding, dSNE, and its differentially private counterpart, DP-dSNE. We leverage publicly available datasets to simultaneously map data samples located at different sites according to their similarities. Even though the data never leaves the individual sites, dSNE does not provide any formal privacy guarantees. To overcome that, we rely on differential privacy: a formal mathematical guarantee that protects individuals from being identified as contributors to a dataset. We implement DP-dSNE with AdaCliP, a method recently proposed to add less noise to the gradients per iteration. We introduce metrics for measuring the embedding quality and validate our algorithms on these metrics against their centralized counterpart on two toy datasets. Our validation on six multisite neuroimaging datasets shows promising results for the quality control tasks of visualization and outlier detection, highlighting the potential of our private, decentralized visualization approach.
隐私问题、机构或 IRB 政策、对本地计算或存储资源或下载功能的访问等原因,可能会阻止将数据汇集到单个站点进行分析。为了在这种约束条件下开展富有成效的研究,越来越多的多站点项目和联盟形成了联邦环境下的协作关系。在这种情况下,通过全局聚合本地计算来可视化分散数据的质量控制工具尤为重要,因为它可以筛选否则无法进行联合评估的样本。为了解决这个问题,我们提出了两种算法:去中心化数据随机近邻嵌入(dSNE)及其差分隐私版本 DP-dSNE。我们利用公开数据集,根据数据样本的相似性,同时对位于不同站点的数据样本进行映射。尽管数据从未离开过各个站点,但 dSNE 并未提供任何正式的隐私保证。为了克服这一问题,我们依赖差分隐私:这是一种正式的数学保证,可以保护个人不被识别为数据集的贡献者。我们使用 AdaCliP 实现 DP-dSNE,这是一种最近提出的方法,可以在每次迭代中减少梯度的噪声。我们引入了度量标准来衡量嵌入质量,并在两个玩具数据集上针对其集中式对应物对我们的算法进行了验证。我们在六个多站点神经影像学数据集上的验证结果表明,可视化和异常值检测的质量控制任务具有良好的效果,突出了我们私有、去中心化可视化方法的潜力。