Woods Andrew, Kramer Skyler T, Xu Dong, Jiang Wei
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States.
Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States.
JMIR Bioinform Biotechnol. 2023 Jul 18;4:e44700. doi: 10.2196/44700.
While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party.
In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference.
Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority.
We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model.
Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security.
虽然基因组变异可为医疗保健和血统提供有价值的信息,但个人基因组数据的隐私必须得到保护。因此,人类DNA数据库需要一个安全的环境,使全部数据可查询,但相关方(如数据托管方和医院)无法直接访问,并且查询结果只能由用户或授权方得知。
在本研究中,我们针对基因组序列中的单核苷酸多态性(SNP)面板进行高效且安全的计算,这些计算是在并集、交集、集合差和对称差等集合操作下进行的。
通过这些操作,我们可以计算相似性度量,如杰卡德相似性,这能够在安全查询DNA数据库时找到同一个人和遗传亲属。我们分析了各种安全范式,并展示了在几种安全假设下协议的度量,如半诚实、诚实多数恶意和恶意多数恶意。
我们表明我们的方法可以实际应用于实际规模的数据。具体而言,在诚实多数恶意对手假设下,我们可以在2.16秒内计算两个包含40万个SNP的基因组集合的杰卡德相似性,在半诚实模型下为0.36秒。
我们的方法可能有助于采用可信环境来托管具有端到端数据安全性的个人基因组数据。