School of Biomedical Informatics, University of Texas, Health Science Center at Houston, Houston, 77030, TX, USA.
Microsoft Research, Redmond, 98052, WA, USA.
BMC Med Genomics. 2020 Jul 21;13(Suppl 7):99. doi: 10.1186/s12920-020-0724-z.
The sharing of biomedical data is crucial to enable scientific discoveries across institutions and improve health care. For example, genome-wide association studies (GWAS) based on a large number of samples can identify disease-causing genetic variants. The privacy concern, however, has become a major hurdle for data management and utilization. Homomorphic encryption is one of the most powerful cryptographic primitives which can address the privacy and security issues. It supports the computation on encrypted data, so that we can aggregate data and perform an arbitrary computation on an untrusted cloud environment without the leakage of sensitive information.
This paper presents a secure outsourcing solution to assess logistic regression models for quantitative traits to test their associations with genotypes. We adapt the semi-parallel training method by Sikorska et al., which builds a logistic regression model for covariates, followed by one-step parallelizable regressions on all individual single nucleotide polymorphisms (SNPs). In addition, we modify our underlying approximate homomorphic encryption scheme for performance improvement.
We evaluated the performance of our solution through experiments on real-world dataset. It achieves the best performance of homomorphic encryption system for GWAS analysis in terms of both complexity and accuracy. For example, given a dataset consisting of 245 samples, each of which has 10643 SNPs and 3 covariates, our algorithm takes about 43 seconds to perform logistic regression based genome wide association analysis over encryption.
We demonstrate the feasibility and scalability of our solution.
生物医学数据的共享对于在机构间实现科学发现和改善医疗保健至关重要。例如,基于大量样本的全基因组关联研究(GWAS)可以识别致病的遗传变异。然而,隐私问题已成为数据管理和利用的主要障碍。同态加密是最强大的密码学原语之一,可以解决隐私和安全问题。它支持加密数据的计算,因此我们可以在不受信任的云环境中聚合数据并执行任意计算,而不会泄露敏感信息。
本文提出了一种安全的外包解决方案,用于评估用于定量性状的逻辑回归模型,以检验它们与基因型的关联。我们采用了 Sikorska 等人提出的半并行训练方法,该方法为协变量构建逻辑回归模型,然后对所有个体单核苷酸多态性(SNP)进行一步可并行化回归。此外,我们还修改了我们的基础近似同态加密方案以提高性能。
我们通过在真实数据集上的实验评估了我们的解决方案的性能。它在同态加密系统的 GWAS 分析方面实现了最佳的性能,无论是在复杂性还是准确性方面。例如,对于包含 245 个样本的数据集,每个样本具有 10643 个 SNP 和 3 个协变量,我们的算法大约需要 43 秒在加密环境中执行基于逻辑回归的全基因组关联分析。
我们证明了我们的解决方案的可行性和可扩展性。