Zeng Haoyang, Hashimoto Tatsunori, Kang Daniel D, Gifford David K
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA and.
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA and Department of Stem Cell and Regenerative Biology, Harvard University and Harvard Medical School, Cambridge, MA 02138, USA.
Bioinformatics. 2016 Feb 15;32(4):490-6. doi: 10.1093/bioinformatics/btv565. Epub 2015 Oct 17.
The majority of disease-associated variants identified in genome-wide association studies reside in noncoding regions of the genome with regulatory roles. Thus being able to interpret the functional consequence of a variant is essential for identifying causal variants in the analysis of genome-wide association studies.
We present GERV (generative evaluation of regulatory variants), a novel computational method for predicting regulatory variants that affect transcription factor binding. GERV learns a k-mer-based generative model of transcription factor binding from ChIP-seq and DNase-seq data, and scores variants by computing the change of predicted ChIP-seq reads between the reference and alternate allele. The k-mers learned by GERV capture more sequence determinants of transcription factor binding than a motif-based approach alone, including both a transcription factor's canonical motif and associated co-factor motifs. We show that GERV outperforms existing methods in predicting single-nucleotide polymorphisms associated with allele-specific binding. GERV correctly predicts a validated causal variant among linked single-nucleotide polymorphisms and prioritizes the variants previously reported to modulate the binding of FOXA1 in breast cancer cell lines. Thus, GERV provides a powerful approach for functionally annotating and prioritizing causal variants for experimental follow-up analysis.
The implementation of GERV and related data are available at http://gerv.csail.mit.edu/.
在全基因组关联研究中鉴定出的大多数疾病相关变异位于基因组的非编码区域,具有调控作用。因此,在全基因组关联研究分析中,能够解释变异的功能后果对于识别因果变异至关重要。
我们提出了GERV(调控变异的生成性评估),这是一种预测影响转录因子结合的调控变异的新型计算方法。GERV从ChIP-seq和DNase-seq数据中学习基于k-mer的转录因子结合生成模型,并通过计算参考等位基因和替代等位基因之间预测的ChIP-seq读数变化对变异进行评分。与仅基于基序的方法相比,GERV学习的k-mer捕获了更多转录因子结合的序列决定因素,包括转录因子的典型基序和相关的辅助因子基序。我们表明,GERV在预测与等位基因特异性结合相关的单核苷酸多态性方面优于现有方法。GERV正确地预测了连锁单核苷酸多态性中一个经过验证的因果变异,并对先前报道的在乳腺癌细胞系中调节FOXA1结合的变异进行了优先级排序。因此,GERV为功能注释和对因果变异进行优先级排序以进行实验后续分析提供了一种强大的方法。
GERV的实现及相关数据可在http://gerv.csail.mit.edu/获取。