Ge Steven Xijin
Department of Mathematics and Statistics, South Dakota State University, Box 2225, Brookings, SD, 57110, USA.
BMC Genomics. 2017 Feb 23;18(1):200. doi: 10.1186/s12864-017-3566-0.
Instead of testing predefined hypotheses, the goal of exploratory data analysis (EDA) is to find what data can tell us. Following this strategy, we re-analyzed a large body of genomic data to study the complex gene regulation in mouse pre-implantation development (PD).
Starting with a single-cell RNA-seq dataset consisting of 259 mouse embryonic cells derived from zygote to blastocyst stages, we reconstructed the temporal and spatial gene expression pattern during PD. The dynamics of gene expression can be partially explained by the enrichment of transposable elements in gene promoters and the similarity of expression profiles with those of corresponding transposons. Long Terminal Repeats (LTRs) are associated with transient, strong induction of many nearby genes at the 2-4 cell stages, probably by providing binding sites for Obox and other homeobox factors. B1 and B2 SINEs (Short Interspersed Nuclear Elements) are correlated with the upregulation of thousands of nearby genes during zygotic genome activation. Such enhancer-like effects are also found for human Alu and bovine tRNA SINEs. SINEs also seem to be predictive of gene expression in embryonic stem cells (ESCs), raising the possibility that they may also be involved in regulating pluripotency. We also identified many potential transcription factors underlying PD and discussed the evolutionary necessity of transposons in enhancing genetic diversity, especially for species with longer generation time.
Together with other recent studies, our results provide further evidence that many transposable elements may play a role in establishing the expression landscape in early embryos. It also demonstrates that exploratory bioinformatics investigation can pinpoint developmental pathways for further study, and serve as a strategy to generate novel insights from big genomic data.
探索性数据分析(EDA)的目标不是检验预先设定的假设,而是发现数据能告诉我们什么。遵循这一策略,我们重新分析了大量基因组数据,以研究小鼠植入前发育(PD)过程中的复杂基因调控。
从一个包含从合子到囊胚阶段的259个小鼠胚胎细胞的单细胞RNA测序数据集开始,我们重建了PD过程中的时空基因表达模式。基因表达的动态变化可以部分地由基因启动子中转座元件的富集以及与相应转座子表达谱的相似性来解释。长末端重复序列(LTRs)与2-4细胞阶段许多附近基因的短暂、强烈诱导有关,可能是通过为Obox和其他同源框因子提供结合位点。B1和B2短散在核元件(SINEs)与合子基因组激活期间数千个附近基因的上调相关。在人类Alu和牛tRNA SINEs中也发现了这种增强子样效应。SINEs似乎也能预测胚胎干细胞(ESCs)中的基因表达,这增加了它们也可能参与调节多能性的可能性。我们还确定了许多PD潜在的转录因子,并讨论了转座子在增强遗传多样性方面的进化必要性,特别是对于世代时间较长的物种。
与其他近期研究一起,我们的结果进一步证明许多转座元件可能在早期胚胎中建立表达格局中发挥作用。它还表明,探索性生物信息学研究可以确定进一步研究的发育途径,并作为从大型基因组数据中产生新见解的一种策略。