• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于隐私保护合成基因组序列生成的全基因组信息语言模型

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.

作者信息

Huang Pengzhi, Charton François, Schmelzle Jan-Niklas M, Darnell Shelby S, Prins Pjotr, Garrison Erik, Suh G Edward

机构信息

Cornell University.

FAIR, Meta.

出版信息

bioRxiv. 2024 Sep 24:2024.09.18.612131. doi: 10.1101/2024.09.18.612131.

DOI:10.1101/2024.09.18.612131
PMID:39386557
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11463672/
Abstract

The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and -mer tokenizations. We find -mer tokenization schemes, indicating that our tokenization schemes boost the model's performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make it comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.

摘要

基因组数据集的公开可用,如人类基因组计划(HGP)、千人基因组计划、癌症基因组图谱和国际人类基因组单体型图计划,极大地推动了科学研究和医学认知。在此,我们的目标是在通过差分隐私(DP)保护个人隐私的同时,共享此类基因组信息以供下游分析。我们引入了基于泛基因组并结合预训练语言模型(PTLMs)的合成DNA数据生成方法。我们基于泛基因组图引入了两种新颖的分词方案,以增强对DNA的建模。我们评估了这些分词方法,并将它们与经典的单核苷酸和k-mer分词方法进行了比较。我们发现k-mer分词方案,这表明我们的分词方案通过长有效上下文长度(用相同数量的词元覆盖更长的序列)提高了模型的性能一致性。此外,我们提出了一种利用泛基因组图并使其符合DP隐私标准的方法。我们评估了DP训练对生成序列质量的性能,并讨论了隐私与模型准确性之间的权衡。我们工作的源代码将很快在免费和开源许可下发布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/4195ad6ea712/nihpp-2024.09.18.612131v3-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/77e0dfe532ef/nihpp-2024.09.18.612131v3-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/91f33465a0fd/nihpp-2024.09.18.612131v3-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/f92848bfbb74/nihpp-2024.09.18.612131v3-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/96b1441ce377/nihpp-2024.09.18.612131v3-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/ca50c87d1722/nihpp-2024.09.18.612131v3-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/2343ed417c32/nihpp-2024.09.18.612131v3-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/8bdbed52b6a8/nihpp-2024.09.18.612131v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/b00b48c2699a/nihpp-2024.09.18.612131v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/f26c2034197c/nihpp-2024.09.18.612131v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/9caaec14baf5/nihpp-2024.09.18.612131v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/791d9d356a1d/nihpp-2024.09.18.612131v3-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/4195ad6ea712/nihpp-2024.09.18.612131v3-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/77e0dfe532ef/nihpp-2024.09.18.612131v3-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/91f33465a0fd/nihpp-2024.09.18.612131v3-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/f92848bfbb74/nihpp-2024.09.18.612131v3-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/96b1441ce377/nihpp-2024.09.18.612131v3-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/ca50c87d1722/nihpp-2024.09.18.612131v3-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/2343ed417c32/nihpp-2024.09.18.612131v3-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/8bdbed52b6a8/nihpp-2024.09.18.612131v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/b00b48c2699a/nihpp-2024.09.18.612131v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/f26c2034197c/nihpp-2024.09.18.612131v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/9caaec14baf5/nihpp-2024.09.18.612131v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/791d9d356a1d/nihpp-2024.09.18.612131v3-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9df/12218601/4195ad6ea712/nihpp-2024.09.18.612131v3-f0006.jpg

相似文献

1
Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.用于隐私保护合成基因组序列生成的全基因组信息语言模型
bioRxiv. 2024 Sep 24:2024.09.18.612131. doi: 10.1101/2024.09.18.612131.
2
Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.液体活检能否通过低深度全基因组测序检测肉瘤患者的循环肿瘤DNA?一项初步评估。
Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.
3
"We're all in it together": uniting a diverse range of professionals and people with lived experience within the development of a complex, theory-based paediatric speech and language therapy intervention.“我们同舟共济”:在一项基于理论的复杂儿科言语和语言治疗干预措施的开发过程中,团结各类专业人员以及有实际经验的人士。
Res Involv Engagem. 2025 Jun 19;11(1):67. doi: 10.1186/s40900-025-00738-8.
4
A systematic review of speech, language and communication interventions for children with Down syndrome from 0 to 6 years.对0至6岁唐氏综合征儿童言语、语言和沟通干预措施的系统评价。
Int J Lang Commun Disord. 2022 Mar;57(2):441-463. doi: 10.1111/1460-6984.12699. Epub 2022 Feb 22.
5
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
6
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
7
Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验:定性证据综合。
Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.
8
Generalizable machine learning for stress monitoring from wearable devices: A systematic literature review.用于可穿戴设备压力监测的通用机器学习:系统文献综述
Int J Med Inform. 2023 May;173:105026. doi: 10.1016/j.ijmedinf.2023.105026. Epub 2023 Feb 28.
9
Evaluating a Large Language Model's Ability to Synthesize a Health Science Master's Thesis: Case Study.评估大型语言模型合成健康科学硕士论文的能力:案例研究
JMIR Form Res. 2025 Jul 3;9:e73248. doi: 10.2196/73248.
10
Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach.提高社交媒体帖子中自杀意念检测的能力:主题建模与合成数据增强方法
JMIR Form Res. 2025 Jun 11;9:e63272. doi: 10.2196/63272.

引用本文的文献

1
Population health management through human phenotype ontology with policy for ecosystem improvement.通过人类表型本体进行人群健康管理并制定生态系统改善政策。
Front Artif Intell. 2025 Aug 1;8:1496937. doi: 10.3389/frai.2025.1496937. eCollection 2025.

本文引用的文献

1
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.墨丘利神杖:双向等变远程DNA序列建模
Proc Mach Learn Res. 2024 Jul;235:43632-43648.
2
GENA-LM: a family of open-source foundational DNA language models for long sequences.GENA-LM:用于长序列的开源基础DNA语言模型家族。
Nucleic Acids Res. 2025 Jan 11;53(2). doi: 10.1093/nar/gkae1310.
3
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。
Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.
4
Sequence modeling and design from molecular to genome scale with Evo.基于 Evo 在从分子到基因组尺度上进行序列建模和设计。
Science. 2024 Nov 15;386(6723):eado9336. doi: 10.1126/science.ado9336.
5
Species-aware DNA language models capture regulatory elements and their evolution.物种感知的 DNA 语言模型可以捕获调控元件及其进化。
Genome Biol. 2024 Apr 2;25(1):83. doi: 10.1186/s13059-024-03221-x.
6
Large language models to identify social determinants of health in electronic health records.利用大语言模型识别电子健康记录中的健康社会决定因素。
NPJ Digit Med. 2024 Jan 11;7(1):6. doi: 10.1038/s41746-023-00970-0.
7
A study of generative large language model for medical research and healthcare.一项关于用于医学研究和医疗保健的生成式大语言模型的研究。
NPJ Digit Med. 2023 Nov 16;6(1):210. doi: 10.1038/s41746-023-00958-w.
8
A draft human pangenome reference.人类泛基因组参考草图。
Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.
9
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
10
Creating artificial human genomes using generative neural networks.使用生成式神经网络创建人工人类基因组。
PLoS Genet. 2021 Feb 4;17(2):e1009303. doi: 10.1371/journal.pgen.1009303. eCollection 2021 Feb.