Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain.
J Theor Biol. 2012 Mar 21;297:127-36. doi: 10.1016/j.jtbi.2011.12.024. Epub 2011 Dec 30.
Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.
文本中相关的词(关键词)是聚集的,而常见词则是随机分布的。鉴于许多功能基因组元件的聚集分布,我们假设作为生物学文本典范的 DNA 序列可能也具有同样的特征:具有明确功能的 k 长度词(k -mer)可能沿着一维染色体序列聚集,而不太重要的、非功能的词则可能随机分布。为了探索这种语言上的类比,我们计算了人类和小鼠染色体序列中每个 k-mer(k=2-9bp)的聚类系数,然后检查聚集的词是否在基因组的功能部分富集。首先,我们发现聚类水平和词在exon 和转录因子结合位点(TFBS)中的富集程度之间存在正相关的一般趋势,而在重复序列中这种关系要弱得多,在 intron 中则完全不存在这种关系。其次,我们发现 200 个最聚集的 8-mer 中有 38.45%(即 79 个)存在于已知的 motif 数据库中,而非聚集的词只有 7.70%(即 16 个)存在。第三,富集/缺失实验表明,高度聚集的词在 exon 和 TFBS 中显著富集,而在 intron 和重复 DNA 中则明显缺失。将 exon 和 TFBS 一起考虑,人类的前 1417 个(或 72.26%)和小鼠的前 1385 个(或 72.97%)最聚集的 8-mer 与 exon 或 TFBS 有统计学意义上的显著关联,因此强烈支持词聚类与生物功能之间的联系。最后,我们确定了一组聚集的、具有诊断意义的词,它们在 exon 中富集而在 intron 中缺失,因此可能有助于区分这两个基因区域。DNA 词的聚类因此成为在基因组序列中检测功能的新原则。由于进化保守性不是先决条件,这里描述的原理证明可能开辟新的途径来检测物种特异性的功能 DNA 序列,并改进基因和启动子预测,从而有助于在基因组中寻找功能。