Rutgers University, Newark, NJ, USA.
University of Texas Health Science Center at Houston, TX, USA.
J Biomed Inform. 2021 May;117:103714. doi: 10.1016/j.jbi.2021.103714. Epub 2021 Mar 10.
With cloud computing is being widely adopted in conducting genome-wide association studies (GWAS), how to verify the integrity of outsourced GWAS computation remains to be accomplished. Here, we propose two novel algorithms to generate synthetic SNPs that are indistinguishable from real SNPs. The first method creates synthetic SNPs based on the phenotype vector, while the second approach creates synthetic SNPs based on real SNPs that are most similar to the phenotype vector. The time complexity of the first approach and the second approach is Om and Omlogn, respectively, where m is the number of subjects while n is the number of SNPs. Furthermore, through a game theoretic analysis, we demonstrate that it is possible to incentivize honest behavior by the server by coupling appropriate payoffs with randomized verification. We conduct extensive experiments of our proposed methods, and the results show that beyond a formal adversarial model, when only a few synthetic SNPs are generated and mixed into the real data they cannot be distinguished from the real SNPs even by a variety of predictive machine learning models. We demonstrate that the proposed approach can ensure that logistic regression for GWAS can be outsourced in an efficient and trustworthy way.
随着云计算在全基因组关联研究(GWAS)中的广泛应用,如何验证外包 GWAS 计算的完整性仍然有待完成。在这里,我们提出了两种新的算法来生成与真实 SNP 无法区分的合成 SNP。第一种方法基于表型向量生成合成 SNP,而第二种方法则基于与表型向量最相似的真实 SNP 生成合成 SNP。第一种方法和第二种方法的时间复杂度分别为 Om 和 Omlogn,其中 m 是受试者的数量,n 是 SNP 的数量。此外,通过博弈论分析,我们证明通过将适当的报酬与随机验证相结合,可以激励服务器的诚实行为。我们对所提出的方法进行了广泛的实验,结果表明,在正式的对抗模型之外,当只生成少量的合成 SNP 并将其混入真实数据中时,即使使用各种预测机器学习模型,也无法将它们与真实 SNP 区分开来。我们证明了所提出的方法可以确保 GWAS 的逻辑回归可以以高效和值得信赖的方式进行外包。