Zhang Le, Dai Zichun, Yu Jun, Xiao Ming
College of Computer Science, Sichuan University, Chengdu, 610065, PR China.
Medical Big Data Center of Sichuan University, Sichuan University, Chengdu, 610065, PR China.
Brief Bioinform. 2021 Jan 18;22(1):515-525. doi: 10.1093/bib/bbz134.
By reviewing previous CpG-related studies, we consider that the transcription regulation of about half of the human genes, mostly housekeeping (HK) genes, involves CpG islands (CGIs), their methylation states, CpG spacing and other chromosomal parameters. However, the precise CGI definition and positioning of CGIs within gene structures, as well as specific CGI-associated regulatory mechanisms, all remain to be explained at individual gene and gene-family levels, together with consideration of species and lineage specificity. Although previous studies have already classified CGIs into high-CpG (HCGI), intermediate-CpG (ICGI) and low-CpG (LCGI) densities based on CpG density variation, the correlation between CGI density and gene expression regulation, such as co-regulation of CGIs and TATA box on HK genes, remains to be elucidated. First, this study introduces such a problem-solving protocol for human-genome annotation, which is based on a combination of GTEx, JBLA and Gene Ontology (GO) analysis. Next, we discuss why CGI-associated genes are most likely regulated by HCGI and tend to be HK genes; the HCGI/TATA± and LCGI/TATA± combinations show different GO enrichment, whereas the ICGI/TATA± combination is less characteristic based on GO enrichment analysis. Finally, we demonstrate that Hadoop MapReduce-based MR-JBLA algorithm is more efficient than the original JBLA in k-mer counting and CGI-associated gene analysis.
通过回顾以往与CpG相关的研究,我们认为大约一半的人类基因(主要是管家基因)的转录调控涉及CpG岛(CGI)、它们的甲基化状态、CpG间距和其他染色体参数。然而,CGI的精确定义、其在基因结构中的定位以及特定的CGI相关调控机制,在个体基因和基因家族水平上仍有待解释,同时还需考虑物种和谱系特异性。尽管以往的研究已经根据CpG密度变化将CGI分为高CpG(HCGI)、中等CpG(ICGI)和低CpG(LCGI)密度,但CGI密度与基因表达调控之间的相关性,如HK基因上CGI与TATA框的共同调控,仍有待阐明。首先,本研究介绍了一种基于GTEx、JBLA和基因本体论(GO)分析相结合的人类基因组注释问题解决方案。接下来,我们讨论了为什么与CGI相关的基因最有可能受HCGI调控且倾向于是HK基因;基于GO富集分析,HCGI/TATA±和LCGI/TATA±组合显示出不同的GO富集,而ICGI/TATA±组合的特征则不太明显。最后,我们证明基于Hadoop MapReduce的MR-JBLA算法在k-mer计数和与CGI相关的基因分析方面比原始的JBLA更有效。