Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA.
Genome Res. 2024 Oct 11;34(9):1312-1323. doi: 10.1101/gr.279057.124.
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging owing to the burden of estimating kinship between all the pairs of individuals across data sets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and data sets. On a data set of 200,000 individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic data sets.
在许多基因组研究中,在研究队列中找到亲属是必要的步骤。然而,当队列分布在多个受到数据共享限制的实体中时,执行此步骤通常变得不可行。由于需要估计数据集之间所有个体对之间的亲缘关系,因此为这项任务开发隐私保护解决方案具有挑战性。我们引入了 SF-Relate,这是一种实用且安全的联邦算法,用于在数据孤岛之间识别遗传亲属。SF-Relate 通过新颖的局部敏感哈希(LSH)方法大大减少了要比较的个体对的数量,同时保持了准确的检测。我们将可能相关的个体分配到同一个桶中,然后仅在各方的匹配桶中的个体之间测试关系。为此,我们构建了一种有效的哈希函数,该函数可以捕获遗传序列中的同源(IBD)片段,这与新的分组策略一起,可以实现准确而实用的私有相对检测。为了保证隐私,我们引入了一种基于多方同态加密(MHE)的高效算法,允许数据持有者合作计算个体之间的亲缘系数,并进一步对其亲缘关系程度进行分类,所有这些都无需共享任何私人数据。我们在 UK Biobank 和其他数据集上展示了 SF-Relate 的准确性和实际运行时间。在一个由两个群体划分的 20 万人的数据集中,SF-Relate 在 15 小时的运行时间内检测到 97%的第三级或更接近的亲属。我们的工作实现了大规模基因组数据集之间安全的亲属识别。