Suppr超能文献

深入黑暗核心:人类非编码DNA的大规模聚类

Into the heart of darkness: large-scale clustering of human non-coding DNA.

作者信息

Bejerano Gill, Haussler David, Blanchette Mathieu

机构信息

Center for Biomolecular Science and Engineering, Baskin School of Engineering University of California in Santa Cruz, Santa Cruz, CA 95064, USA.

出版信息

Bioinformatics. 2004 Aug 4;20 Suppl 1:i40-8. doi: 10.1093/bioinformatics/bth946.

Abstract

MOTIVATION

It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited.

RESULTS

We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis.

AVAILABILITY

Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html

摘要

动机

目前人们认为,人类基因组中包含的非编码功能区域数量大约是蛋白质编码基因数量的两倍,但我们对这些区域的了解非常有限。

结果

我们研究了人类、小鼠和大鼠基因组中同线保守序列之间的交集,以及人类基因组本身内部的序列相似性,以寻找非蛋白质编码元件家族。为此,我们开发了一种图论聚类算法,类似于用于阐明蛋白质序列家族关系的非常成功的方法。该算法应用于一组经过高度筛选的约70万个不类似于任何已知编码序列的人类-啮齿动物进化保守区域,这些区域占人类基因组的3.7%。从中,我们获得了大约12000个非单例聚类,这些聚类在显著的序列相似性方面很密集。对基因组位置、转录证据和RNA二级结构的进一步分析表明,许多聚类在一个或多个特征上具有显著的同质性。因此,人类基因组中高度保守的非蛋白质编码元件的这一子集包含丰富的家族样结构,值得深入分析。

可用性

这项工作的补充材料可在http://www.soe.ucsc.edu/~jill/dark.html获取

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验