Sei Yuichi, Ohsuga Akihiko
The University of Electro-Communications, Tokyo, Japan.
BioData Min. 2021 Jan 22;14(1):6. doi: 10.1186/s13040-021-00238-x.
The importance of privacy protection in analyses of personal data, such as genome-wide association studies (GWAS), has grown in recent years. GWAS focuses on identifying single-nucleotide polymorphisms (SNPs) associated with certain diseases such as cancer and diabetes, and the chi-squared (χ) hypothesis test of independence can be utilized for this identification. However, recent studies have shown that publishing the results of χ tests of SNPs or personal data could lead to privacy violations. Several studies have proposed anonymization methods for χ testing with ε-differential privacy, which is the cryptographic community's de facto privacy metric. However, existing methods can only be applied to 2×2 or 2×3 contingency tables, otherwise their accuracy is low for small numbers of samples. It is difficult to collect numerous high-sensitive samples in many cases such as COVID-19 analysis in its early propagation stage.
We propose a novel anonymization method (RandChiDist), which anonymizes χ testing for small samples. We prove that RandChiDist satisfies differential privacy. We also experimentally evaluate its analysis using synthetic datasets and real two genomic datasets. RandChiDist achieved the least number of Type II errors among existing and baseline methods that can control the ratio of Type I errors.
We propose a new differentially private method, named RandChiDist, for anonymizing χ values for an I×J contingency table with a small number of samples. The experimental results show that RandChiDist outperforms existing methods for small numbers of samples.
近年来,在全基因组关联研究(GWAS)等个人数据分析中,隐私保护的重要性日益凸显。GWAS专注于识别与某些疾病(如癌症和糖尿病)相关的单核苷酸多态性(SNP),卡方(χ)独立性假设检验可用于此识别。然而,最近的研究表明,公布SNP或个人数据的χ检验结果可能导致隐私侵犯。多项研究提出了使用ε-差分隐私进行χ检验的匿名化方法,ε-差分隐私是密码学界事实上的隐私度量标准。然而,现有方法仅适用于2×2或2×3列联表,否则对于少量样本其准确性较低。在许多情况下,如COVID-19早期传播阶段的分析,很难收集到大量高敏感性样本。
我们提出了一种新颖的匿名化方法(RandChiDist),该方法可对少量样本的χ检验进行匿名化处理。我们证明了RandChiDist满足差分隐私。我们还使用合成数据集和真实的两个基因组数据集对其分析进行了实验评估。在能够控制I类错误率的现有方法和基线方法中,RandChiDist的II类错误数量最少。
我们提出了一种名为RandChiDist的新的差分隐私方法,用于对少量样本的I×J列联表的χ值进行匿名化处理。实验结果表明,对于少量样本,RandChiDist优于现有方法。