Margulies Elliott H, Blanchette Mathieu, Haussler David, Green Eric D
Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
Genome Res. 2003 Dec;13(12):2507-18. doi: 10.1101/gr.1602203.
Comparative sequence analysis has become an essential component of studies aiming to elucidate genome function. The increasing availability of genomic sequences from multiple vertebrates is creating the need for computational methods that can detect highly conserved regions in a robust fashion. Towards that end, we are developing approaches for identifying sequences that are conserved across multiple species; we call these "Multi-species Conserved Sequences" (or MCSs). Here we report two strategies for MCS identification, demonstrating their ability to detect virtually all known actively conserved sequences (specifically, coding sequences) but very little neutrally evolving sequence (specifically, ancestral repeats). Importantly, we find that a substantial fraction of the bases within MCSs (approximately 70%) resides within non-coding regions; thus, the majority of sequences conserved across multiple vertebrate species has no known function. Initial characterization of these MCSs has revealed sequences that correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements. Finally, the ability to detect MCSs represents a valuable metric for assessing the relative contribution of a species' sequence to identifying genomic regions of interest, and our results indicate that the currently available genome sequences are insufficient for the comprehensive identification of MCSs in the human genome.
比较序列分析已成为旨在阐明基因组功能的研究的重要组成部分。多种脊椎动物基因组序列的可得性不断增加,这就需要能够以可靠方式检测高度保守区域的计算方法。为此,我们正在开发识别跨多个物种保守序列的方法;我们将这些序列称为“多物种保守序列”(或MCS)。在此,我们报告两种识别MCS的策略,展示了它们检测几乎所有已知的活跃保守序列(具体而言,编码序列)但很少检测到中性进化序列(具体而言,祖先重复序列)的能力。重要的是,我们发现MCS内相当一部分碱基(约70%)位于非编码区域;因此,跨多个脊椎动物物种保守的大多数序列尚无已知功能。对这些MCS的初步表征揭示了与转录因子结合位点簇、非编码RNA转录本及其他候选功能元件相对应的序列。最后,检测MCS的能力是评估一个物种的序列对识别感兴趣的基因组区域的相对贡献的一个有价值的指标,我们的结果表明,目前可用的基因组序列不足以全面识别人类基因组中的MCS。