Rozowsky Joel S, Newburger Daniel, Sayward Fred, Wu Jiaqian, Jordan Greg, Korbel Jan O, Nagalakshmi Ugrappa, Yang Jin, Zheng Deyou, Guigó Roderic, Gingeras Thomas R, Weissman Sherman, Miller Perry, Snyder Michael, Gerstein Mark B
Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA.
Genome Res. 2007 Jun;17(6):732-45. doi: 10.1101/gr.5696007.
For the approximately 1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs-array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that approximately 14% of the novel TARs can be associated with known genes, while approximately 21% can be clustered into approximately 200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.
在ENCODE区域约占人类基因组1%的区域中,用平铺微阵列鉴定出的转录活性区域(TAR)中只有大约一半对应于注释外显子。在此,我们对这大量的“未注释转录”进行分类。我们使用多种不同特征对6988个新型TAR进行分类——跨细胞系和条件的阵列表达谱、序列组成、系统发育谱(17个物种间同线保守性的有无)以及相对于基因的位置。在分类过程中,我们首先滤除序列组成异常以及可能由交叉杂交导致的TAR。然后,我们将其余一些TAR与具有相关表达谱的近端外显子联系起来。最后,我们根据相似的表达和系统发育谱将未分类的TAR聚类为假定的新基因座。为了概括我们的分类,我们构建了一个活性区域与工具数据库(DART.gersteinlab.org)。DART具有特殊功能,可快速处理和比较多组TAR及其异质性特征、跨版本同步以及与其他资源对接。总体而言,我们发现约14%的新型TAR可与已知基因相关联,而约21%可聚类为约200个新基因座。我们观察到与基因相关的TAR在形成结构RNA的潜力方面富集,并且许多新的TAR簇与附近的启动子相关联。为了对我们的分类进行基准测试,我们设计了一组实验来测试新型TAR的连通性。总体而言,我们发现46个测试连接中有18个通过RT-PCR得到验证,5个测序的PCR产物中有4个明确证实了连通性。