Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
Stanford University School of Medicine, Department of Genetics, Stanford, CA 94305, USA.
Cell. 2020 Nov 12;183(4):905-917.e16. doi: 10.1016/j.cell.2020.09.036.
The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.
功能基因组学数据集的产生正在蓬勃发展,因为它们提供了对基因调控和生物体表型的深入了解(例如,癌症中上调的基因)。功能基因组学实验的目的不一定是研究遗传变异,但由于它们使用下一代测序技术,因此引起了隐私问题。此外,由于广泛共享原始读取数据可以提高统计能力和研究的可重复性,因此存在广泛共享的强烈动机。因此,我们需要超越传统的受控访问模型的新共享模式。在这里,我们开发了一种数据净化程序,允许在最小化隐私泄露的情况下共享原始功能基因组学读数,从而实现有原则的隐私-效用权衡。我们的协议适用于传统的基于 Illumina 的测定和更新的技术,例如 10x 单细胞 RNA 测序。它涉及通过从统计上将研究参与者与已知个体联系起来来量化读取中的隐私泄露。我们使用来自高度准确的参考基因组和更现实的环境样本的数据进行了这些关联。