Wieland Shannon C, Cassa Christopher A, Mandl Kenneth D, Berger Bonnie
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA.
Proc Natl Acad Sci U S A. 2008 Nov 18;105(46):17608-13. doi: 10.1073/pnas.0801021105. Epub 2008 Nov 17.
Datasets describing the health status of individuals are important for medical research but must be used cautiously to protect patient privacy. For patient data containing geographical identifiers, the conventional solution is to aggregate the data by large areas. This method often preserves privacy but suffers from substantial information loss, which degrades the quality of subsequent disease mapping or cluster detection studies. Other heuristic methods for de-identifying spatial patient information do not quantify the risk to individual privacy. We develop an optimal method based on linear programming to add noise to individual locations that preserves the distribution of a disease. The method ensures a small, quantitative risk of individual re-identification. Because the amount of noise added is minimal for the desired degree of privacy protection, the de-identified set is ideal for spatial epidemiological studies. We apply the method to patients in New York County, New York, showing that privacy is guaranteed while moving patients 25-150 times less than aggregation by zip code.
描述个体健康状况的数据集对医学研究很重要,但必须谨慎使用以保护患者隐私。对于包含地理标识符的患者数据,传统的解决方案是按大面积对数据进行汇总。这种方法通常能保护隐私,但会遭受大量信息损失,这会降低后续疾病映射或聚类检测研究的质量。其他用于对空间患者信息进行去识别的启发式方法没有量化对个人隐私的风险。我们开发了一种基于线性规划的优化方法,向个体位置添加噪声以保留疾病的分布。该方法确保了个体重新识别的风险较小且可量化。由于为达到所需的隐私保护程度而添加的噪声量最小,去识别后的数据集非常适合用于空间流行病学研究。我们将该方法应用于纽约州纽约县的患者,结果表明在保证隐私的同时,移动患者的次数比按邮政编码汇总少25至150倍。