Suppr超能文献

利用泛化格保护基因组序列匿名性。

Protecting genomic sequence anonymity with generalization lattices.

作者信息

Malin B A

机构信息

Carnegie Mellon University, School of Computer Science, Institute for Software Research International, Wean Hall Room 1320 B, Pittsburgh, PA 15213-3890, USA.

出版信息

Methods Inf Med. 2005;44(5):687-92.

Abstract

OBJECTIVES

Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual's identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences.

METHODS

The technique is termed DNA lattice anonymization (DNALA), and is based upon the formal privacy protection schema of k -anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines).

RESULTS

The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy.

CONCLUSIONS

The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific datasharing scenarios.

摘要

目标

当前的基因组隐私技术认为,如果诸如人口统计学信息等个人信息被模糊处理、删除或加密,那么基因组序列数据的身份就得到了保护。虽然人口统计学特征可直接危及个人身份,但最近的研究表明,此类保护并不充分,因为序列数据本身易于被重新识别。为解决这一问题,我们引入了一种对个人特定DNA序列集合进行匿名化处理的算法。

方法

该技术被称为DNA晶格匿名化(DNALA),它基于k -匿名性的正式隐私保护模式。在这种模式下,无法观察或了解到能将一个基因序列与集合中其他k - 1个序列区分开来的特征。为了使受保护序列中保留的信息最大化,我们引入了概念泛化晶格来了解单核苷酸区域中两个残基之间的距离。该晶格为两个残基提供最相似的泛化概念(例如,腺嘌呤和鸟嘌呤都是嘌呤)。

结果

该方法使用了几个公开可用的人类群体数据集进行测试和评估,数据集大小从30到400个序列不等。我们的研究结果表明,这种匿名化模式对于保护序列隐私是可行的。

结论

DNALA方法是第一种用于一般DNA序列的计算性披露控制技术。鉴于该方法的计算性质,可以正式证明其匿名性保证。尽管这项研究为未来的研究人员构建针对特定数据共享场景的基因组匿名化模式奠定了基础,但仍有改进和验证的空间。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验