对349种人类细胞和组织中的DNA基序进行全面发现揭示了基序的新特征。

Comprehensive discovery of DNA motifs in 349 human cells and tissues reveals new features of motifs.

作者信息

Zheng Yiyu, Li Xiaoman, Hu Haiyan

机构信息

Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA.

Burnett School of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL 32816, USA

出版信息

Nucleic Acids Res. 2015 Jan;43(1):74-83. doi: 10.1093/nar/gku1261. Epub 2014 Dec 10.

DOI:10.1093/nar/gku1261

PMID:25505144

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4288161/

Abstract

Comprehensive motif discovery under experimental conditions is critical for the global understanding of gene regulation. To generate a nearly complete list of human DNA motifs under given conditions, we employed a novel approach to de novo discover significant co-occurring DNA motifs in 349 human DNase I hypersensitive site datasets. We predicted 845 to 1325 motifs in each dataset, for a total of 2684 non-redundant motifs. These 2684 motifs contained 54.02 to 75.95% of the known motifs in seven large collections including TRANSFAC. In each dataset, we also discovered 43 663 to 2 013 288 motif modules, groups of motifs with their binding sites co-occurring in a significant number of short DNA regions. Compared with known interacting transcription factors in eight resources, the predicted motif modules on average included 84.23% of known interacting motifs. We further showed new features of the predicted motifs, such as motifs enriched in proximal regions rarely overlapped with motifs enriched in distal regions, motifs enriched in 5' distal regions were often enriched in 3' distal regions, etc. Finally, we observed that the 2684 predicted motifs classified the cell or tissue types of the datasets with an accuracy of 81.29%. The resources generated in this study are available at http://server.cs.ucf.edu/predrem/.

摘要

在实验条件下进行全面的基序发现对于全面理解基因调控至关重要。为了在给定条件下生成一份近乎完整的人类DNA基序列表，我们采用了一种新颖的方法，从头发现349个人类DNA酶I超敏位点数据集中显著共现的DNA基序。我们在每个数据集中预测了845至1325个基序，总共得到2684个非冗余基序。这2684个基序包含了包括TRANSFAC在内的七个大型集合中已知基序的54.02%至75.95%。在每个数据集中，我们还发现了43663至2013288个基序模块，即其结合位点在大量短DNA区域中共现的基序组。与八个资源中已知的相互作用转录因子相比，预测的基序模块平均包含84.23%的已知相互作用基序。我们进一步展示了预测基序的新特征，例如在近端区域富集的基序很少与在远端区域富集的基序重叠，在5'远端区域富集的基序通常也在3'远端区域富集等。最后，我们观察到这2684个预测基序对数据集的细胞或组织类型进行分类的准确率为81.29%。本研究生成的资源可在http://server.cs.ucf.edu/predrem/获取。