• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用自动编码器计算连锁不平衡感知的基因组嵌入。

Computing linkage disequilibrium aware genome embeddings using autoencoders.

机构信息

Department of Econometrics and Operations Research, Tilburg University, Tilburg 5037AB, The Netherlands.

Department of Neurology, University Medical Center Utrecht, Utrecht 3584CX, The Netherlands.

出版信息

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae326.

DOI:10.1093/bioinformatics/btae326
PMID:38775680
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11208726/
Abstract

MOTIVATION

The completion of the genome has paved the way for genome-wide association studies (GWAS), which explained certain proportions of heritability. GWAS are not optimally suited to detect non-linear effects in disease risk, possibly hidden in non-additive interactions (epistasis). Alternative methods for epistasis detection using, e.g. deep neural networks (DNNs) are currently under active development. However, DNNs are constrained by finite computational resources, which can be rapidly depleted due to increasing complexity with the sheer size of the genome. Besides, the curse of dimensionality complicates the task of capturing meaningful genetic patterns for DNNs; therefore necessitates dimensionality reduction.

RESULTS

We propose a method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block's genetic content. We provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning. We applied this method to genotyping data from Project MinE, and achieved 99% average test reconstruction accuracy-i.e. minimal information loss-while compressing the input to nearly 10% of the original size. We demonstrate that haplotype-block based autoencoders outperform linear Principal Component Analysis (PCA) by approximately 3% chromosome-wide accuracy of reconstructed variants. To the extent of our knowledge, our approach is the first to simultaneously leverage haplotype structure and DNNs for dimensionality reduction of genetic data.

AVAILABILITY AND IMPLEMENTATION

Data are available for academic use through Project MinE at https://www.projectmine.com/research/data-sharing/, contingent upon terms and requirements specified by the source studies. Code is available at https://github.com/gizem-tas/haploblock-autoencoders.

摘要

动机

基因组的完成为全基因组关联研究(GWAS)铺平了道路,GWAS 解释了某些遗传率。GWAS 不太适合检测疾病风险中的非线性效应,这些效应可能隐藏在非加性相互作用(上位性)中。目前正在积极开发使用深度神经网络(DNN)等方法来检测上位性。然而,DNN 受到有限的计算资源的限制,由于基因组规模的巨大增加,这些资源可能会迅速耗尽。此外,维度的诅咒使 DNN 捕捉有意义的遗传模式的任务变得复杂;因此需要降维。

结果

我们提出了一种压缩单核苷酸多态性(SNP)数据的方法,同时利用连锁不平衡(LD)结构并保留潜在的上位性。该方法涉及将相关的 SNP 聚类成单倍型块,并对每个块进行自动编码器训练,以学习块遗传内容的压缩表示。我们提供了一种可调节的自动编码器设计,以适应不同的块并避免广泛的超参数调整。我们将此方法应用于 Project MinE 的基因分型数据,实现了 99%的平均测试重建准确性-即最小信息丢失-同时将输入压缩到原始大小的近 10%。我们证明基于单倍型块的自动编码器比线性主成分分析(PCA)在重建变体的全染色体准确率上平均高出约 3%。据我们所知,我们的方法是第一个同时利用单倍型结构和 DNN 来降低遗传数据维度的方法。

可用性和实现

数据可通过 Project MinE 在 https://www.projectmine.com/research/data-sharing/ 上供学术使用,前提是符合源研究规定的条款和要求。代码可在 https://github.com/gizem-tas/haploblock-autoencoders 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/0216cca176e9/btae326f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/e6c130e86fb6/btae326f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/f3f4920b2384/btae326f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/cea0d9110f84/btae326f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/c810453fc7c5/btae326f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/c2cb839e486b/btae326f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/0216cca176e9/btae326f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/e6c130e86fb6/btae326f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/f3f4920b2384/btae326f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/cea0d9110f84/btae326f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/c810453fc7c5/btae326f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/c2cb839e486b/btae326f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc4d/11208726/0216cca176e9/btae326f6.jpg

相似文献

1
Computing linkage disequilibrium aware genome embeddings using autoencoders.使用自动编码器计算连锁不平衡感知的基因组嵌入。
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae326.
2
Selecting Closely-Linked SNPs Based on Local Epistatic Effects for Haplotype Construction Improves Power of Association Mapping.基于局部上位效应选择紧密连锁 SNPs 进行单倍型构建可提高关联作图的功效。
G3 (Bethesda). 2019 Dec 3;9(12):4115-4126. doi: 10.1534/g3.119.400451.
3
Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women.利用深度学习和全基因组关联研究对非裔美国妇女的由上位效应驱动的早产进行分类。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):668-678. doi: 10.1109/TCBB.2018.2868667. Epub 2018 Sep 3.
4
Performance of a blockwise approach in variable selection using linkage disequilibrium information.使用连锁不平衡信息进行变量选择时的分块方法性能。
BMC Bioinformatics. 2015 May 8;16:148. doi: 10.1186/s12859-015-0556-6.
5
Performance of epistasis detection methods in semi-simulated GWAS.连锁不平衡检测方法在半模拟 GWAS 中的性能。
BMC Bioinformatics. 2018 Jun 18;19(1):231. doi: 10.1186/s12859-018-2229-8.
6
A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs.基于高度相关 SNPs 簇的区间图建模的密集基因组测序数据新型单倍型块检测方法。
Bioinformatics. 2018 Feb 1;34(3):388-397. doi: 10.1093/bioinformatics/btx609.
7
RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method.基于单倍型的全基因组关联研究,使用一种新的 SNP 集方法。
PLoS Comput Biol. 2020 Feb 14;16(2):e1007663. doi: 10.1371/journal.pcbi.1007663. eCollection 2020 Feb.
8
Discovering Genome-Wide Tag SNPs Based on the Mutual Information of the Variants.基于变异体互信息发现全基因组标签单核苷酸多态性
PLoS One. 2016 Dec 16;11(12):e0167994. doi: 10.1371/journal.pone.0167994. eCollection 2016.
9
Tagging SNP-set selection with maximum information based on linkage disequilibrium structure in genome-wide association studies.基于全基因组关联研究中连锁不平衡结构的最大信息进行 SNP 集选择标记。
Bioinformatics. 2017 Jul 15;33(14):2078-2081. doi: 10.1093/bioinformatics/btx151.
10
Efficient haplotype block partitioning and tag SNP selection algorithms under various constraints.各种约束条件下的高效单倍型块划分及标签单核苷酸多态性选择算法。
Biomed Res Int. 2013;2013:984014. doi: 10.1155/2013/984014. Epub 2013 Nov 11.

本文引用的文献

1
Common and rare variant association analyses in amyotrophic lateral sclerosis identify 15 risk loci with distinct genetic architectures and neuron-specific biology.常见和罕见变异关联分析在肌萎缩侧索硬化症中确定了 15 个具有不同遗传结构和神经元特异性生物学的风险位点。
Nat Genet. 2021 Dec;53(12):1636-1648. doi: 10.1038/s41588-021-00973-1. Epub 2021 Dec 6.
2
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
3
Adversarial deconfounding autoencoder for learning robust gene expression embeddings.
用于学习稳健基因表达嵌入的对抗性去混淆自动编码器。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i573-i582. doi: 10.1093/bioinformatics/btaa796.
4
From Genotype to Phenotype: Augmenting Deep Learning with Networks and Systems Biology.从基因型到表型:利用网络和系统生物学增强深度学习
Curr Opin Syst Biol. 2019 Jun;15:68-73. doi: 10.1016/j.coisb.2019.04.001. Epub 2019 Apr 4.
5
Lifetime Risk and Heritability of Amyotrophic Lateral Sclerosis.肌萎缩侧索硬化症的终生风险与遗传度
JAMA Neurol. 2019 Nov 1;76(11):1367-1374. doi: 10.1001/jamaneurol.2019.2044.
6
Deep learning in biomedicine.深度学习在生物医学中的应用。
Nat Biotechnol. 2018 Oct;36(9):829-838. doi: 10.1038/nbt.4233. Epub 2018 Sep 6.
7
Project MinE: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis.Project MinE:肌萎缩侧索硬化症全基因组测序大型研究的设计和初步分析。
Eur J Hum Genet. 2018 Oct;26(10):1537-1546. doi: 10.1038/s41431-018-0177-4. Epub 2018 Jun 28.
8
SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.SNPrune:一种基于高度连锁不平衡的高效算法,用于修剪大型 SNP 数组和序列数据集。
Genet Sel Evol. 2018 Jun 26;50(1):34. doi: 10.1186/s12711-018-0404-z.
9
The curse(s) of dimensionality.维度诅咒
Nat Methods. 2018 Jun;15(6):399-400. doi: 10.1038/s41592-018-0019-x.
10
Linkage disequilibrium clustering-based approach for association mapping with tightly linked genomewide data.基于连锁不平衡聚类的连锁基因组数据关联作图方法。
Mol Ecol Resour. 2018 Jul;18(4):809-824. doi: 10.1111/1755-0998.12893. Epub 2018 May 7.