Mongin Emmanuel, Dewar Ken, Blanchette Mathieu
McGill Centre for Bioinformatics, McGill University, Montreal, Canada.
BMC Evol Biol. 2009 Aug 15;9:203. doi: 10.1186/1471-2148-9-203.
The availability of newly sequenced vertebrate genomes, along with more efficient and accurate alignment algorithms, have enabled the expansion of the field of comparative genomics. Large-scale genome rearrangement events modify the order of genes and non-coding conserved regions on chromosomes. While certain large genomic regions have remained intact over much of vertebrate evolution, others appear to be hotspots for genomic breakpoints. The cause of the non-uniformity of breakpoints that occurred during vertebrate evolution is poorly understood.
We describe a machine learning method to distinguish genomic regions where breakpoints would be expected to have deleterious effects (called breakpoint-refractory regions) from those where they are expected to be neutral (called breakpoint-susceptible regions). Our predictor is trained using breakpoints that took place along the human lineage since amniote divergence. Based on our predictions, refractory and susceptible regions have very distinctive features. Refractory regions are significantly enriched for conserved non-coding elements as well as for genes involved in development, whereas susceptible regions are enriched for housekeeping genes, likely to have simpler transcriptional regulation.
We postulate that long-range transcriptional regulation strongly influences chromosome break fixation. In many regions, the fitness cost of altering the spatial association between long-range regulatory regions and their target genes may be so high that rearrangements are not allowed. Consequently, only a limited, identifiable fraction of the genome is susceptible to genome rearrangements.
新测序的脊椎动物基因组的可得性,以及更高效和准确的比对算法,推动了比较基因组学领域的扩展。大规模基因组重排事件改变了染色体上基因和非编码保守区域的顺序。虽然某些大的基因组区域在脊椎动物进化的大部分过程中保持完整,但其他区域似乎是基因组断点的热点。脊椎动物进化过程中发生的断点不均匀性的原因尚不清楚。
我们描述了一种机器学习方法,用于区分预计断点会产生有害影响的基因组区域(称为断点难处理区域)和预计断点为中性的区域(称为断点易感区域)。我们的预测器使用自羊膜动物分化以来在人类谱系中发生的断点进行训练。根据我们的预测,难处理区域和易感区域具有非常独特的特征。难处理区域显著富集保守非编码元件以及参与发育的基因,而易感区域则富集管家基因,其转录调控可能更简单。
我们推测长程转录调控强烈影响染色体断点固定。在许多区域,改变长程调控区域与其靶基因之间空间关联的适应性成本可能非常高,以至于不允许重排。因此,只有基因组中有限的、可识别的一部分易发生基因组重排。