Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA.
Nat Commun. 2018 Jun 22;9(1):2453. doi: 10.1038/s41467-018-04875-5.
Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information. Here, we attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. We present information theoretic measures for the degree to which one can genotype these deletions. We then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, we present an anonymization method removing much of the leakage from signal profiles.
功能基因组学实验,如 RNA-seq,提供了在不同条件下(如疾病和正常)基因表达的非个体特异性信息。人们非常希望分享这些数据。然而,隐私问题常常排除了原始读数的共享。为了实现安全共享,通常使用聚合摘要,如读取深度信号谱和基因表达水平。GTEx 和 ENCODE 等项目共享这些数据,因为它们表面上不会泄露太多的识别信息。在这里,我们试图量化这种说法的有效性,衡量信号谱中基因组缺失的泄露程度。我们提出了信息论度量标准,用于衡量一个人对这些缺失进行基因分型的程度。然后,我们开发了实用的基因分型方法,并演示了如何在链接攻击的背景下使用这些方法来识别大队列中的个体。最后,我们提出了一种匿名化方法,从信号谱中去除了大部分泄漏。