Wu Samuel S, Chen Shigang, Bhattacharjee Abhishek, He Ying
Department of Biostatistics, University of Florida, Gainesville, FL 32610, USA.
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32610, USA.
Third IEEE Int Conf Big Data Secur Cloud IEEE BigDataSecur 2017 (2017). 2017;2017:1-7. doi: 10.1109/bigdatasecurity.2017.10.
An integral part of any social or medical research is the availability of reliable data. For the integrity of participants' responses, a secure environment for collecting sensitive data is required. This paper introduces a novel privacy-preserving data collection method: (CRM). The CRM method requires multiple masking service providers (MSP), each generating its own random masking matrices. The key step is that each participant's data is randomly decomposed into the sum of component vectors, and each component vector is sent to the MSPs for masking in a different order. The CRM method publicly releases two sets of masked data: one being right multiplied by random invertible matrices and the other being left multiplied by random orthogonal matrices. Both MSPs and the released data may be hosted on cloud platforms. Our data collection and release procedure is designed so that MSPs and the data collector are not able to derive the original participants' data hence providing strong privacy protection. However, statistical inference on parameters of interest will produce exactly the same results from the masked data as from the original data, under commonly used statistical methods such as general linear model, contingency table analysis, logistic regression, and Cox proportional hazard regression.
任何社会或医学研究的一个不可或缺的部分是可靠数据的可用性。为了保证参与者回答的完整性,需要一个安全的环境来收集敏感数据。本文介绍了一种新颖的隐私保护数据收集方法:(CRM)。CRM方法需要多个掩码服务提供商(MSP),每个提供商生成自己的随机掩码矩阵。关键步骤是将每个参与者的数据随机分解为分量向量的和,并且每个分量向量以不同的顺序发送给MSP进行掩码处理。CRM方法公开发布两组掩码数据:一组右乘随机可逆矩阵,另一组左乘随机正交矩阵。MSP和发布的数据都可以托管在云平台上。我们的数据收集和发布程序设计为使得MSP和数据收集者无法推导原始参与者的数据,从而提供强大的隐私保护。然而,在常用的统计方法(如一般线性模型、列联表分析、逻辑回归和Cox比例风险回归)下,对感兴趣参数的统计推断从掩码数据中得出的结果与从原始数据中得出的结果完全相同。