Harmanci Arif, Gerstein Mark
Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA.
Nat Methods. 2016 Mar;13(3):251-6. doi: 10.1038/nmeth.3746. Epub 2016 Feb 1.
Studies on genomic privacy have traditionally focused on identifying individuals using DNA variants. In contrast, molecular phenotype data, such as gene expression levels, are generally assumed to be free of such identifying information. Although there is no explicit genotypic information in phenotype data, adversaries can statistically link phenotypes to genotypes using publicly available genotype-phenotype correlations such as expression quantitative trait loci (eQTLs). This linking can be accurate when high-dimensional data (i.e., many expression levels) are used, and the resulting links can then reveal sensitive information (for example, the fact that an individual has cancer). Here we develop frameworks for quantifying the leakage of characterizing information from phenotype data sets. These frameworks can be used to estimate the leakage from large data sets before release. We also present a general three-step procedure for practically instantiating linking attacks and a specific attack using outlier gene expression levels that is simple yet accurate. Finally, we describe the effectiveness of this outlier attack under different scenarios.
传统上,基因组隐私研究主要集中在利用DNA变异来识别个体。相比之下,分子表型数据,如基因表达水平,通常被认为不包含此类识别信息。尽管表型数据中没有明确的基因型信息,但对手可以利用公开可用的基因型-表型相关性,如表达数量性状位点(eQTL),通过统计方法将表型与基因型联系起来。当使用高维数据(即许多表达水平)时,这种联系可以很准确,而由此产生的联系可能会揭示敏感信息(例如,个体患有癌症这一事实)。在这里,我们开发了一些框架,用于量化从表型数据集中泄露的特征信息。这些框架可用于在发布之前估计大数据集的信息泄露情况。我们还提出了一个实际实施关联攻击的通用三步程序,以及一种使用异常基因表达水平的特定攻击,该攻击简单但准确。最后,我们描述了这种异常值攻击在不同场景下的有效性。