Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken, 526-0829, Japan.
Chromosome Res. 2013 Aug;21(5):461-74. doi: 10.1007/s10577-013-9371-y. Epub 2013 Jul 30.
Since oligonucleotide composition in the genome sequence varies significantly among species even among those possessing the same genome G + C%, the composition has been used to distinguish a wide range of genomes and called as "genome signature". Oligonucleotides often represent motif sequences responsible for sequence-specific protein binding (e.g., transcription-factor binding). Occurrences of such motif oligonucleotides in the genome should be biased compared to those observed in random sequences and may differ among genomes and genomic portions. Self-Organizing Map (SOM) is a powerful tool for clustering high-dimensional data such as oligonucleotide composition on one plane. We previously modified the conventional SOM for genome informatics to batch learning SOM or "BLSOM". When we constructed BLSOMs to analyze pentanucleotide composition in 20-, 50-, and 100-kb sequences derived from the human genome, BLSOMs did not classify human sequences according to chromosome but revealed several specific zones composed primarily of sequences derived from pericentric regions. Interestingly, various transcription-factor-binding motifs were characteristically overrepresented in pericentric regions but underrepresented in most genomic sequences. When we focused on much shorter sequences (e.g., 1 kb), the clustering of transcription-factor-binding motifs was evident in pericentric, subtelomeric and sex chromosome pseudoautosomal regions. The biological significance of the clustering in these regions was discussed in connection with cell-type and -stage-dependent chromocenter formation and nuclear organization.
由于基因组序列中的寡核苷酸组成在物种间甚至在具有相同基因组 G+C%的物种间差异很大,因此该组成已被用于区分广泛的基因组,并称为“基因组特征”。寡核苷酸通常代表负责序列特异性蛋白结合的基序序列(例如,转录因子结合)。与随机序列中观察到的寡核苷酸相比,基因组中此类基序寡核苷酸的出现应该存在偏向性,并且可能在基因组和基因组部分之间存在差异。自组织映射 (SOM) 是一种将基因组信息学中的高维数据(例如寡核苷酸组成)聚类到一个平面上的强大工具。我们之前对传统 SOM 进行了修改,以用于基因组信息学中的批量学习 SOM 或“BLSOM”。当我们构建 BLSOM 以分析来自人类基因组的 20、50 和 100 kb 序列中的五核苷酸组成时,BLSOM 并没有根据染色体对人类序列进行分类,而是揭示了由着丝粒区域主要组成的几个特定区域。有趣的是,各种转录因子结合基序在着丝粒区域中特征性地过表达,而在大多数基因组序列中则表达不足。当我们关注更短的序列(例如 1 kb)时,着丝粒、端粒和性染色体假常染色体区域中的转录因子结合基序聚类明显。讨论了这些区域中聚类的生物学意义,涉及细胞类型和阶段依赖性染色质中心形成和核组织。