Li Ronnie Y, Huang Yanting, Zhao Zhiyue, Qin Zhaohui S
Graduate program in Neuroscience, Emory University, United States.
Department of Computer Science, Emory University, United States.
Data Brief. 2022 Dec 14;46:108827. doi: 10.1016/j.dib.2022.108827. eCollection 2023 Feb.
This manuscript presents a comprehensive collection of diverse epigenomic profiling data for the human genome in 100-bp resolution with full genome-wide coverage. The datasets are processed from raw read count data collected from five types of sequencing-based assays collected by the Encyclopedia of DNA Elements consortium (ENCODE, http://www.encodeproject.org). Data from high-throughput sequencing assays were processed and crystallized into a total of 6,305 genome-wide profiles. To ensure the quality of the features, we filtered out assays with low read depth, inconsistent read counts, and poor data quality. The types of sequencing-based experiment assays include DNase-seq, histone and TF ChIP-seq, ATAC-seq, and Poly(A) RNA-seq. Merging of processed data was done by averaging read counts across technical replicates to obtain signals in about 30 million predefined 100-bp bins that tile the entire genome. We provide an example of fetching read counts using disease-related risk variants from the GWAS Catalog. Additionally, we have created a tabix index enabling fast user retrieval of read counts given coordinates in the human genome. The data processing pipeline is replicable for users' own purposes and for other experimental assays. The processed data can be found on Zenodo at https://zenodo.org/record/7015783. These data can be used as features for statistical and machine learning models to predict or infer a wide range of variables of biological interest. They can also be applied to generate novel insights into gene expression, chromatin accessibility, and epigenetic modifications across the human genome. Finally, the processing pipeline can be easily applied to data from any other genome-wide profiling assays, expanding the amount of available data.
本手稿展示了一套全面的人类基因组表观基因组分析数据,分辨率为100碱基对,覆盖全基因组。这些数据集是从DNA元件百科全书联盟(ENCODE,http://www.encodeproject.org)收集的五种基于测序的检测方法所采集的原始读取计数数据中处理而来。来自高通量测序检测的数据经过处理,汇总成总共6305个全基因组图谱。为确保特征的质量,我们过滤掉了读取深度低、读取计数不一致和数据质量差的检测。基于测序的实验检测类型包括DNase-seq、组蛋白和转录因子ChIP-seq、ATAC-seq以及Poly(A) RNA-seq。通过对技术重复样本的读取计数求平均值来合并处理后的数据,以在覆盖整个基因组的约3000万个预定义100碱基对区间中获得信号。我们提供了一个使用来自全基因组关联研究目录的疾病相关风险变异来获取读取计数的示例。此外,我们创建了一个tabix索引,使用户能够根据人类基因组中的坐标快速检索读取计数。数据处理流程可复制,供用户用于自身目的以及其他实验检测。处理后的数据可在Zenodo上获取,网址为https://zenodo.org/record/7015783。这些数据可作为统计和机器学习模型的特征,用于预测或推断广泛的生物学相关变量。它们还可用于生成关于人类基因组中基因表达、染色质可及性和表观遗传修饰的新见解。最后,该处理流程可轻松应用于任何其他全基因组分析检测的数据,从而增加可用数据量。