Raisaro Jean Louis, Tramèr Florian, Ji Zhanglong, Bu Diyue, Zhao Yongan, Carey Knox, Lloyd David, Sofia Heidi, Baker Dixie, Flicek Paul, Shringarpure Suyash, Bustamante Carlos, Wang Shuang, Jiang Xiaoqian, Ohno-Machado Lucila, Tang Haixu, Wang XiaoFeng, Hubaux Jean-Pierre
School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
Health Science Department of Biomedical Informatics, University of California San Diego, San Diego, CA, USA.
J Am Med Inform Assoc. 2017 Jul 1;24(4):799-805. doi: 10.1093/jamia/ocw167.
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual's whole genome sequence), the individual's membership in a beacon can be inferred through repeated queries for variants present in the individual's genome.In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets.
全球基因组与健康联盟(GA4GH)创建了灯塔计划,作为在最简单的技术环境中测试数据持有者共享遗传数据意愿的一种方式——查询染色体上给定位置是否存在特定核苷酸。每个参与站点(或“灯塔”)负责确保仅在数据所涉及的个人许可下,并根据GA4GH政策和标准,通过灯塔服务公开基因组数据。虽然认识到与大规模数据聚合相关的推断风险,以及一些灯塔包含会增加隐私风险的敏感表型关联这一事实,但GA4GH判定基于二元是/否等位基因存在查询响应的重新识别风险是可以接受的。然而,最近的研究表明,对于具有特定特征的灯塔(包括相对较小的样本量以及拥有个人全基因组序列的对手),通过对个人基因组中存在的变异进行重复查询,可以推断出个人是否属于某个灯塔。在本文中,我们提出了三种降低灯塔中重新识别风险的实用策略。前两种策略对灯塔进行操作,以使罕见等位基因的存在变得模糊;第三种策略为每个个体基因组的每个用户访问次数设定预算。使用包含来自千人基因组计划数据的灯塔,我们证明了所提出的策略可以有效降低类似灯塔数据集的重新识别风险。