Huang Pengzhi, Charton François, Schmelzle Jan-Niklas M, Darnell Shelby S, Prins Pjotr, Garrison Erik, Suh G Edward
Cornell University.
FAIR, Meta.
bioRxiv. 2024 Sep 24:2024.09.18.612131. doi: 10.1101/2024.09.18.612131.
The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and -mer tokenizations. We find -mer tokenization schemes, indicating that our tokenization schemes boost the model's performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make it comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.
基因组数据集的公开可用,如人类基因组计划(HGP)、千人基因组计划、癌症基因组图谱和国际人类基因组单体型图计划,极大地推动了科学研究和医学认知。在此,我们的目标是在通过差分隐私(DP)保护个人隐私的同时,共享此类基因组信息以供下游分析。我们引入了基于泛基因组并结合预训练语言模型(PTLMs)的合成DNA数据生成方法。我们基于泛基因组图引入了两种新颖的分词方案,以增强对DNA的建模。我们评估了这些分词方法,并将它们与经典的单核苷酸和k-mer分词方法进行了比较。我们发现k-mer分词方案,这表明我们的分词方案通过长有效上下文长度(用相同数量的词元覆盖更长的序列)提高了模型的性能一致性。此外,我们提出了一种利用泛基因组图并使其符合DP隐私标准的方法。我们评估了DP训练对生成序列质量的性能,并讨论了隐私与模型准确性之间的权衡。我们工作的源代码将很快在免费和开源许可下发布。