Wan Zhiyu, Vorobeychik Yevgeniy, Xia Weiyi, Clayton Ellen Wright, Kantarcioglu Murat, Malin Bradley
Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA.
Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA.
Am J Hum Genet. 2017 Feb 2;100(2):316-322. doi: 10.1016/j.ajhg.2016.12.002. Epub 2017 Jan 5.
Emerging scientific endeavors are creating big data repositories of data from millions of individuals. Sharing data in a privacy-respecting manner could lead to important discoveries, but high-profile demonstrations show that links between de-identified genomic data and named persons can sometimes be reestablished. Such re-identification attacks have focused on worst-case scenarios and spurred the adoption of data-sharing practices that unnecessarily impede research. To mitigate concerns, organizations have traditionally relied upon legal deterrents, like data use agreements, and are considering suppressing or adding noise to genomic variants. In this report, we use a game theoretic lens to develop more effective, quantifiable protections for genomic data sharing. This is a fundamentally different approach because it accounts for adversarial behavior and capabilities and tailors protections to anticipated recipients with reasonable resources, not adversaries with unlimited means. We demonstrate this approach via a new public resource with genomic summary data from over 8,000 individuals-the Sequence and Phenotype Integration Exchange (SPHINX)-and show that risks can be balanced against utility more effectively than with traditional approaches. We further show the generalizability of this framework by applying it to other genomic data collection and sharing endeavors. Recognizing that such models are dependent on a variety of parameters, we perform extensive sensitivity analyses to show that our findings are robust to their fluctuations.
新兴的科学研究正在创建包含数百万个体数据的大数据存储库。以尊重隐私的方式共享数据可能会带来重要发现,但一些备受瞩目的案例表明,去标识化的基因组数据与特定个人之间的联系有时可能会被重新建立。此类重新识别攻击主要集中在最坏的情况,并促使人们采用了一些不必要阻碍研究的数据共享做法。为了减轻担忧,各组织传统上依赖法律威慑手段,如数据使用协议,并正在考虑对基因组变异进行抑制或添加噪声处理。在本报告中,我们运用博弈论视角为基因组数据共享开发更有效、可量化的保护措施。这是一种根本不同的方法,因为它考虑了对抗性行为和能力,并根据预期接收者的合理资源而非手段无限的对手来定制保护措施。我们通过一个新的公共资源展示了这种方法,该资源包含来自8000多名个体的基因组汇总数据——序列与表型整合交换库(SPHINX),并表明与传统方法相比,风险与效用能够得到更有效的平衡。我们还通过将该框架应用于其他基因组数据收集和共享工作,展示了其通用性。鉴于此类模型依赖于各种参数,我们进行了广泛的敏感性分析,以表明我们的发现对其波动具有稳健性。