Bejerano Gill, Haussler David, Blanchette Mathieu
Center for Biomolecular Science and Engineering, Baskin School of Engineering University of California in Santa Cruz, Santa Cruz, CA 95064, USA.
Bioinformatics. 2004 Aug 4;20 Suppl 1:i40-8. doi: 10.1093/bioinformatics/bth946.
It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited.
We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis.
Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html
目前人们认为,人类基因组中包含的非编码功能区域数量大约是蛋白质编码基因数量的两倍,但我们对这些区域的了解非常有限。
我们研究了人类、小鼠和大鼠基因组中同线保守序列之间的交集,以及人类基因组本身内部的序列相似性,以寻找非蛋白质编码元件家族。为此,我们开发了一种图论聚类算法,类似于用于阐明蛋白质序列家族关系的非常成功的方法。该算法应用于一组经过高度筛选的约70万个不类似于任何已知编码序列的人类-啮齿动物进化保守区域,这些区域占人类基因组的3.7%。从中,我们获得了大约12000个非单例聚类,这些聚类在显著的序列相似性方面很密集。对基因组位置、转录证据和RNA二级结构的进一步分析表明,许多聚类在一个或多个特征上具有显著的同质性。因此,人类基因组中高度保守的非蛋白质编码元件的这一子集包含丰富的家族样结构,值得深入分析。
这项工作的补充材料可在http://www.soe.ucsc.edu/~jill/dark.html获取