Rigoutsos Isidore, Huynh Tien, Miranda Kevin, Tsirigos Aristotelis, McHardy Alice, Platt Daniel
IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA.
Proc Natl Acad Sci U S A. 2006 Apr 25;103(17):6605-10. doi: 10.1073/pnas.0601688103. Epub 2006 Apr 24.
Using an unsupervised pattern-discovery method, we processed the human intergenic and intronic regions and catalogued all variable-length patterns with identically conserved copies and multiplicities above what is expected by chance. Among the millions of discovered patterns, we found a subset of 127,998 patterns, termed pyknons, which have additional nonoverlapping instances in the untranslated and protein-coding regions of 30,675 transcripts from 20,059 human genes. The pyknons arrange combinatorially in the untranslated and coding regions of numerous human genes where they form mosaics. Consecutive instances of pyknons in these regions show a strong bias in their relative placement, favoring distances of approximately 22 nucleotides. We also found pyknons to be enriched in a statistically significant manner in genes involved in specific processes, e.g., cell communication, transcription, regulation of transcription, signaling, transport, etc. For approximately 1/3 of the pyknons, the intergenic/intronic instances of their reverse complement lie within 380,084 nonoverlapping regions, typically 60-80 nucleotides long, which are predicted to form double-stranded, energetically stable, hairpin-shaped RNA secondary structures; additionally, the pyknons subsume approximately 40% of the known microRNA sequences, thus suggesting a possible link with posttranscriptional gene silencing and RNA interference. Cross-genome comparisons reveal that many of the pyknons have instances in the 3' UTRs of genes from other vertebrates and invertebrates where they are overrepresented in similar biological processes, as in the human genome. These unexpected findings suggest potential unique functional connections between the coding and noncoding parts of the human genome.
我们采用一种无监督模式发现方法,对人类基因间区域和内含子区域进行处理,并编目了所有具有相同保守拷贝且出现频率高于随机预期的可变长度模式。在数百万个发现的模式中,我们发现了一个由127,998个模式组成的子集,称为“致密子”(pyknons),它们在来自20,059个人类基因的30,675个转录本的非翻译区和蛋白质编码区有额外的非重叠实例。致密子在众多人类基因的非翻译区和编码区以组合方式排列,形成镶嵌图案。这些区域中致密子的连续实例在其相对位置上表现出强烈的偏向性,倾向于大约22个核苷酸的距离。我们还发现致密子在参与特定过程的基因中以统计学上显著的方式富集,例如细胞通讯、转录、转录调控、信号传导、运输等。对于大约三分之一的致密子,其反向互补序列的基因间/内含子实例位于380,084个非重叠区域内,这些区域通常长60 - 80个核苷酸,预计会形成双链、能量稳定的发夹状RNA二级结构;此外,致密子包含了大约40%的已知微小RNA序列,因此暗示了与转录后基因沉默和RNA干扰的可能联系。跨基因组比较显示,许多致密子在其他脊椎动物和无脊椎动物基因的3' UTR中有实例,并且在类似的生物学过程中它们在这些基因组中也有过度表达,就像在人类基因组中一样。这些意外发现表明人类基因组的编码部分和非编码部分之间可能存在独特的功能联系。