Department of Bioengineering and Bioinformatics, Moscow State University, Moscow 119992, Russia.
Institute for Information Transmission Problems, RAS, Moscow 127994, Russia.
Bioinformatics. 2017 Oct 15;33(20):3158-3165. doi: 10.1093/bioinformatics/btx379.
Genomics features with similar genome-wide distributions are generally hypothesized to be functionally related, for example, colocalization of histones and transcription start sites indicate chromatin regulation of transcription factor activity. Therefore, statistical algorithms to perform spatial, genome-wide correlation among genomic features are required.
Here, we propose a method, StereoGene, that rapidly estimates genome-wide correlation among pairs of genomic features. These features may represent high-throughput data mapped to reference genome or sets of genomic annotations in that reference genome. StereoGene enables correlation of continuous data directly, avoiding the data binarization and subsequent data loss. Correlations are computed among neighboring genomic positions using kernel correlation. Representing the correlation as a function of the genome position, StereoGene outputs the local correlation track as part of the analysis. StereoGene also accounts for confounders such as input DNA by partial correlation. We apply our method to numerous comparisons of ChIP-Seq datasets from the Human Epigenome Atlas and FANTOM CAGE to demonstrate its wide applicability. We observe the changes in the correlation between epigenomic features across developmental trajectories of several tissue types consistent with known biology and find a novel spatial correlation of CAGE clusters with donor splice sites and with poly(A) sites. These analyses provide examples for the broad applicability of StereoGene for regulatory genomics.
The StereoGene C ++ source code, program documentation, Galaxy integration scripts and examples are available from the project homepage http://stereogene.bioinf.fbb.msu.ru/.
Supplementary data are available at Bioinformatics online.
具有相似全基因组分布的基因组特征通常被假设为具有功能相关性,例如,组蛋白和转录起始位点的共定位表明染色质调节转录因子活性。因此,需要统计算法来执行基因组特征之间的空间、全基因组相关性。
在这里,我们提出了一种方法 StereoGene,它可以快速估计基因组特征对之间的全基因组相关性。这些特征可以代表映射到参考基因组或该参考基因组中基因组注释集的高通量数据。StereoGene 能够直接对连续数据进行相关,避免了数据的二值化和随后的数据丢失。使用核相关计算相邻基因组位置之间的相关性。将相关性表示为基因组位置的函数,StereoGene 将局部相关性轨迹作为分析的一部分输出。StereoGene 还通过偏相关来考虑输入 DNA 等混杂因素。我们将我们的方法应用于人类表观基因组图谱和 FANTOM CAGE 的大量 ChIP-Seq 数据集的比较中,以证明其广泛的适用性。我们观察到几种组织类型的发育轨迹中表观基因组特征之间的相关性发生变化,这与已知的生物学一致,并发现 CAGE 簇与供体剪接位点和 poly(A) 位点之间存在新的空间相关性。这些分析为 StereoGene 在调控基因组学中的广泛适用性提供了示例。
StereoGene 的 C++源代码、程序文档、Galaxy 集成脚本和示例可从项目主页 http://stereogene.bioinf.fbb.msu.ru/ 获取。
补充数据可在生物信息学在线获得。