Wan Shibiao, Wang Jieqiong
Center for Applied Bioinformatics, St. Jude Children's Research Hospital, Memphis, TN, United States.
Department of Radiology, University of Pennsylvania, Philadelphia, PA, United States.
Front Genet. 2022 Apr 13;13:876686. doi: 10.3389/fgene.2022.876686. eCollection 2022.
With the technological advances in recent decades, determining whole genome sequencing of a person has become feasible and affordable. As a result, large-scale individual genomic sequences are produced and collected for genetic medical diagnoses and cancer drug discovery, which, however, simultaneously poses serious challenges to the protection of personal genomic privacy. It is highly urgent to develop methods which make the personal genomic data both utilizable and confidential. Existing genomic privacy-protection methods are either time-consuming for encryption or with low accuracy of data recovery. To tackle these problems, this paper proposes a sequence similarity-based obfuscation method, namely IterMegaBLAST, for fast and reliable protection of personal genomic privacy. Specifically, given a randomly selected sequence from a dataset of genomic sequences, we first use MegaBLAST to find its most similar sequence from the dataset. These two aligned sequences form a cluster, for which an obfuscated sequence was generated a DNA generalization lattice scheme. These procedures are iteratively performed until all of the sequences in the dataset are clustered and their obfuscated sequences are generated. Experimental results on benchmark datasets demonstrate that under the same degree of anonymity, IterMegaBLAST significantly outperforms existing state-of-the-art approaches in terms of both utility accuracy and time complexity.
随着近几十年来技术的进步,确定一个人的全基因组序列已变得可行且成本可承受。因此,为了进行遗传医学诊断和癌症药物研发,大量的个人基因组序列被生成并收集,然而,这同时也给个人基因组隐私保护带来了严峻挑战。开发既能使个人基因组数据可利用又能保密的方法迫在眉睫。现有的基因组隐私保护方法要么加密耗时,要么数据恢复准确率低。为了解决这些问题,本文提出了一种基于序列相似性的混淆方法,即IterMegaBLAST,用于快速可靠地保护个人基因组隐私。具体而言,给定从基因组序列数据集中随机选择的一个序列,我们首先使用MegaBLAST从数据集中找到与其最相似的序列。这两个比对后的序列形成一个簇,针对该簇采用DNA泛化格方案生成一个混淆序列。这些过程反复执行,直到数据集中所有序列都被聚类并生成它们的混淆序列。在基准数据集上的实验结果表明,在相同匿名程度下,IterMegaBLAST在效用准确率和时间复杂度方面均显著优于现有的最先进方法。