Öksüz Abdullah Çağlar, Ayday Erman, Güdükbay Uğur
Department of Computer Engineering, Bilkent University, Ankara, Turkey.
Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA.
Bioinformatics. 2021 Sep 9;37(17):2668-2674. doi: 10.1093/bioinformatics/btab128.
Genome data is a subject of study for both biology and computer science since the start of the Human Genome Project in 1990. Since then, genome sequencing for medical and social purposes becomes more and more available and affordable. Genome data can be shared on public websites or with service providers (SPs). However, this sharing compromises the privacy of donors even under partial sharing conditions. We mainly focus on the liability aspect ensued by the unauthorized sharing of these genome data. One of the techniques to address the liability issues in data sharing is the watermarking mechanism.
To detect malicious correspondents and SPs-whose aim is to share genome data without individuals' consent and undetected-, we propose a novel watermarking method on sequential genome data using belief propagation algorithm. In our method, we have two criteria to satisfy. (i) Embedding robust watermarks so that the malicious adversaries cannot temper the watermark by modification and are identified with high probability. (ii) Achieving ϵ-local differential privacy in all data sharings with SPs. For the preservation of system robustness against single SP and collusion attacks, we consider publicly available genomic information like Minor Allele Frequency, Linkage Disequilibrium, Phenotype Information and Familial Information. Our proposed scheme achieves 100% detection rate against the single SP attacks with only 3% watermark length. For the worst case scenario of collusion attacks (50% of SPs are malicious), 80% detection is achieved with 5% watermark length and 90% detection is achieved with 10% watermark length. For all cases, the impact of ϵ on precision remained negligible and high privacy is ensured.
https://github.com/acoksuz/PPRW\_SGD\_BPLDP.
Supplementary data are available at Bioinformatics online.
自1990年人类基因组计划启动以来,基因组数据一直是生物学和计算机科学的研究对象。从那时起,用于医学和社会目的的基因组测序变得越来越容易获得且成本越来越低。基因组数据可以在公共网站上共享,也可以与服务提供商(SP)共享。然而,即使在部分共享的情况下,这种共享也会损害捐赠者的隐私。我们主要关注这些基因组数据未经授权共享所引发的责任问题。解决数据共享中责任问题的技术之一是水印机制。
为了检测恶意通信者和服务提供商(其目的是在未经个人同意的情况下共享基因组数据且不被发现),我们提出了一种使用信念传播算法的针对顺序基因组数据的新型水印方法。在我们的方法中,我们要满足两个标准。(i)嵌入鲁棒水印,使恶意对手无法通过修改来篡改水印,并能以高概率被识别。(ii)在与服务提供商的所有数据共享中实现ε-局部差分隐私。为了保持系统对单个服务提供商和勾结攻击的鲁棒性,我们考虑公开可用的基因组信息,如次要等位基因频率、连锁不平衡、表型信息和家族信息。我们提出的方案在水印长度仅为3%时,针对单个服务提供商攻击的检测率达到100%。对于勾结攻击的最坏情况(50%的服务提供商是恶意的),水印长度为5%时检测率达到80%,水印长度为10%时检测率达到90%。在所有情况下,ε对精度的影响仍然可以忽略不计,并确保了高隐私性。
https://github.com/acoksuz/PPRW_SGD_BPLDP。
补充数据可在《生物信息学》在线获取。