Suppr超能文献

准确过滤原始基因组数据中的隐私敏感信息。

Accurate filtering of privacy-sensitive information in raw genomic data.

机构信息

SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg.

LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal.

出版信息

J Biomed Inform. 2018 Jun;82:1-12. doi: 10.1016/j.jbi.2018.04.006. Epub 2018 Apr 13.

Abstract

Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

摘要

对数千个人类基因组进行测序,在精准医疗、罕见病研究和法医学等领域取得了突破。然而,如果不以最高标准加以保护,大规模收集这些敏感数据将带来巨大风险。本文中,我们持这一立场并认为,在对齐后进行隐私保护是不够的,数据应在基因组学工作流程中尽早自动受到保护,理想情况下是在数据生成后立即进行保护。我们表明,先前用于过滤短读段的方法无法扩展到长读段,并提出了一种新的过滤方法,即将原始基因组数据(即位置和内容尚未确定的数据)分类为隐私敏感(即更容易受到成功的隐私攻击影响)和非隐私敏感信息。这种分类允许对保护措施进行细粒度和自动化的调整,以减轻暴露的可能后果,特别是在依赖公共云的情况下。我们提出了第一个可用于过滤任何长度读段的过滤器,也就是说,它可以与任何最新或未来的测序技术一起使用。该过滤器是准确的,因为它可以检测到所有已知的敏感核苷酸,除了那些位于高度变异区域的核苷酸(每个基因组中检测到的未检测到的核苷酸少于 10 个,而不是以前的工作中 100000 个)。它比以前已知的方法具有更少的假阳性(10%而不是 60%),并且即使存在测序错误也可以检测到敏感核苷酸(检测到 86%,而以前的方法在 2%的突变时只能检测到 56%)。最后,实际实验表明,在吞吐量和内存消耗方面都具有很高的性能。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验