Ernst Jason, Kellis Manolis
1] Department of Biological Chemistry, University of California, Los Angeles, California, USA. [2] Computer Science Department, University of California, Los Angeles, California, USA. [3] Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at UCLA, Los Angeles, California, USA. [4] Jonsson Comprehensive Cancer Center, University of California, Los Angeles, California, USA. [5] Molecular Biology Institute, University of California, Los Angeles, California, USA.
1] MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA. [2] Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
Nat Biotechnol. 2015 Apr;33(4):364-76. doi: 10.1038/nbt.3157. Epub 2015 Feb 18.
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
有了数百个表观基因组图谱,就有机会利用表观遗传信号在标记和样本间的相关性,对其他数据集进行大规模预测。在此,我们通过回归树集成利用这种相关性进行表观基因组插补。我们插补了4315个高分辨率信号图谱,其中26%也通过实验观察到。插补的信号轨迹与观察到的信号总体相似,在一致性、基因注释恢复和疾病相关变异富集方面超过了实验数据集。我们使用插补数据来检测低质量的实验数据集,寻找具有意外表观遗传信号的基因组位点,为新实验定义高优先级标记,并在涵盖不同组织和细胞类型的127个参考表观基因组中描绘染色质状态。我们的插补数据集提供了迄今为止最全面的人类调控区域注释,我们的方法和ChromImpute软件构成了表观基因组信息大规模实验图谱的有用补充。