IEEE/ACM Trans Comput Biol Bioinform. 2018 Sep-Oct;15(5):1405-1412. doi: 10.1109/TCBB.2018.2859380. Epub 2018 Jul 24.
The dramatically decreasing costs of DNA sequencing have triggered more than a million humans to have their genotypes sequenced. Moreover, these individuals increasingly make their genomic data publicly available, thereby creating privacy threats for themselves and their relatives because of their DNA similarities. More generally, an entity that gains access to a significant fraction of sequenced genotypes might be able to infer even the genomes of unsequenced individuals. In this paper, we propose a simulation-based model for quantifying the impact of continuously sequencing and publicizing personal genomic data on a population's genomic privacy. Our simulation probabilistically models data sharing and takes into account events such as migration and interracial mating. We exemplarily instantiate our simulation with a sample population of 1,000 individuals and evaluate the privacy under multiple settings over 6,000 genomic variants and a subset of phenotype-related variants. Our findings demonstrate that an increasing sharing rate in the future entails a substantial negative effect on the privacy of all older generations. Moreover, we find that mixed populations face a less severe erosion of privacy over time than more homogeneous populations. Finally, we demonstrate that genomic-data sharing can be much more detrimental for the privacy of the phenotype-related variants.
DNA 测序成本的大幅下降促使超过 100 万人对自己的基因型进行了测序。此外,这些个体越来越多地公开他们的基因组数据,从而对他们自己和他们的亲属造成隐私威胁,因为他们的 DNA 相似。更普遍地说,一个能够访问大量测序基因型的实体可能能够推断出甚至未测序个体的基因组。在本文中,我们提出了一个基于模拟的模型,用于量化不断测序和公开个人基因组数据对人群基因组隐私的影响。我们的模拟概率模型数据共享,并考虑了数据共享等事件,例如迁移和跨种族交配。我们使用 1000 个人的样本人口实例化我们的模拟,并在 6000 多个基因组变体和表型相关变体的子集上的多个设置下评估隐私。我们的研究结果表明,未来分享率的增加将对所有老一代的隐私产生实质性的负面影响。此外,我们发现,混合人群随着时间的推移,隐私受到的侵蚀程度比更同质的人群要小。最后,我们证明了基因组数据共享对表型相关变体的隐私可能造成更大的损害。