Anand Shankara, Kalesinskas Laurynas, Smail Craig, Tanigawa Yosuke
Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, U.S.A.*These authors contributed equally to this work.
Pac Symp Biocomput. 2019;24:184-195.
Genetic variations of the human genome are linked to many disease phenotypes. While whole-genome sequencing and genome-wide association studies (GWAS) have uncovered a number of genotype-phenotype associations, their functional interpretation remains challenging given most single nucleotide polymorphisms (SNPs) fall into the non-coding region of the genome. Advances in chromatin immunoprecipitation sequencing (ChIP-seq) have made large-scale repositories of epigenetic data available, allowing investigation of coordinated mechanisms of epigenetic markers and transcriptional regulation and their influence on biological function. To address this, we propose SNPs2ChIP, a method to infer biological functions of non-coding variants through unsupervised statistical learning methods applied to publicly-available epigenetic datasets. We systematically characterized latent factors by applying singular value decomposition to ChIP-seq tracks of lymphoblastoid cell lines, and annotated the biological function of each latent factor using the genomic region enrichment analysis tool. Using these annotated latent factors as reference, we developed SNPs2ChIP, a pipeline that takes genomic region(s) as an input, identifies the relevant latent factors with quantitative scores, and returns them along with their inferred functions. As a case study, we focused on systemic lupus erythematosus and demonstrated our method's ability to infer relevant biological function. We systematically applied SNPs2ChIP on publicly available datasets, including known GWAS associations from the GWAS catalogue and ChIP-seq peaks from a previously published study. Our approach to leverage latent patterns across genome-wide epigenetic datasets to infer the biological function will advance understanding of the genetics of human diseases by accelerating the interpretation of non-coding genomes.
人类基因组的遗传变异与许多疾病表型相关。虽然全基因组测序和全基因组关联研究(GWAS)已经发现了一些基因型与表型的关联,但鉴于大多数单核苷酸多态性(SNP)位于基因组的非编码区域,对其功能的解释仍然具有挑战性。染色质免疫沉淀测序(ChIP-seq)技术的进步使得大规模表观遗传数据存储库得以建立,从而能够研究表观遗传标记和转录调控的协同机制及其对生物学功能的影响。为了解决这个问题,我们提出了SNPs2ChIP方法,该方法通过应用于公开可用表观遗传数据集的无监督统计学习方法来推断非编码变异的生物学功能。我们通过对淋巴母细胞系的ChIP-seq轨迹应用奇异值分解来系统地表征潜在因子,并使用基因组区域富集分析工具注释每个潜在因子的生物学功能。以这些注释的潜在因子为参考,我们开发了SNPs2ChIP流程,该流程以基因组区域为输入,通过定量评分识别相关潜在因子,并返回这些因子及其推断的功能。作为一个案例研究,我们聚焦于系统性红斑狼疮,并展示了我们方法推断相关生物学功能的能力。我们在公开可用数据集上系统地应用了SNPs2ChIP,包括来自GWAS目录的已知GWAS关联和先前发表研究中的ChIP-seq峰。我们利用全基因组表观遗传数据集的潜在模式来推断生物学功能的方法,将通过加速非编码基因组的解释,推动对人类疾病遗传学的理解。