Computer and Data Sciences, Case Western Reserve University, OH 44106, United States.
Research, IBM, D15HN66, Ireland.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i168-i176. doi: 10.1093/bioinformatics/btad274.
The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators' datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.
基因组测序技术的快速发展导致了本地收集的基因组数据集的激增。考虑到基因组数据的敏感性,在保护个人隐私的同时进行合作研究至关重要。然而,在开始任何合作研究之前,都需要评估数据的质量。质量控制过程的基本步骤之一是群体分层:识别个体由于亚群而存在遗传差异。一种常用的基于祖先将个体基因组分组的方法是主成分分析(PCA)。在本文中,我们提出了一种隐私保护框架,该框架利用 PCA 将个体分配到多个合作者的群体中,作为群体分层步骤的一部分。在我们提出的基于客户端-服务器的方案中,我们最初让服务器在一个包含来自多个群体的个体的公共基因组数据集中训练全局 PCA 模型。然后,全局 PCA 模型被每个合作者(客户端)用来降低本地数据的维度。在添加噪声以实现本地差分隐私(LDP)之后,合作者将关于其研究数据集的元数据(以本地 PCA 输出的形式)发送到服务器,然后服务器对齐本地 PCA 结果以识别合作者数据集之间的遗传差异。我们在真实基因组数据上的结果表明,所提出的框架可以在保护研究参与者隐私的同时,以高精度进行群体分层分析。