Medical Data Privacy and Privacy-Preserving ML on Healthcare Data, Department of Computer Science, University of Tübingen, Tübingen, Germany.
Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
Bioinformatics. 2022 Apr 12;38(8):2202-2210. doi: 10.1093/bioinformatics/btac070.
Diagnosis and treatment decisions on genomic data have become widespread as the cost of genome sequencing decreases gradually. In this context, disease-gene association studies are of great importance. However, genomic data are very sensitive when compared to other data types and contains information about individuals and their relatives. Many studies have shown that this information can be obtained from the query-response pairs on genomic databases. In this work, we propose a method that uses secure multi-party computation to query genomic databases in a privacy-protected manner. The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud environments. It provides data privacy, query privacy and output privacy by using XOR-based sharing and unlike previous solutions, it allows queries to run efficiently on hundreds of thousands of genomic data.
We measure the performance of our solution with parameters similar to real-world applications. It is possible to query a genomic database with 3 000 000 variants with five genomic query predicates under 400 ms. Querying 1 048 576 genomes, each containing 1 000 000 variants, for the presence of five different query variants can be achieved approximately in 6 min with a small amount of dedicated hardware and connectivity. These execution times are in the right range to enable real-world applications in medical research and healthcare. Unlike previous studies, it is possible to query multiple databases with response times fast enough for practical application. To the best of our knowledge, this is the first solution that provides this performance for querying large-scale genomic data.
https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries.
Supplementary data are available at Bioinformatics online.
随着基因组测序成本的逐步降低,基于基因组数据的诊断和治疗决策已经变得非常普遍。在这种背景下,疾病基因关联研究非常重要。然而,与其他类型的数据相比,基因组数据非常敏感,其中包含有关个人及其亲属的信息。许多研究表明,可以从基因组数据库的查询-响应对中获取这些信息。在这项工作中,我们提出了一种使用安全多方计算以隐私保护方式查询基因组数据库的方法。所提出的解决方案将基因组数据从任意数量的来源私下外包给两个非串通的代理,并允许安全地将基因组数据库存储在半诚实的云环境中。它通过基于 XOR 的共享来提供数据隐私、查询隐私和输出隐私,与以前的解决方案不同,它允许在数十万基因组数据上高效运行查询。
我们使用类似于实际应用的参数来衡量我们的解决方案的性能。可以在 400ms 内使用五个基因组查询谓词查询包含 300 万个变体的基因组数据库。使用少量专用硬件和连接性,大约可以在 6 分钟内查询包含 100 万个变体的 1048576 个基因组中是否存在五个不同的查询变体。这些执行时间在可以实现实际应用的范围内,可用于医疗研究和医疗保健中的实际应用。与以前的研究不同,它可以快速响应时间查询多个数据库,足以满足实际应用的需求。据我们所知,这是第一个为查询大规模基因组数据提供这种性能的解决方案。
https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries。
补充数据可在《生物信息学》在线获取。