Triska Martin, Solovyev Victor, Baranova Ancha, Kel Alexander, Tatarinova Tatiana V
Children's Hospital Los Angeles, University of Southern California, Los Angeles, CA, United States of America.
Faculty of Advanced Technology, University of South Wales, Pontypridd, Wales, United Kingdom.
PLoS One. 2017 Nov 15;12(11):e0187243. doi: 10.1371/journal.pone.0187243. eCollection 2017.
Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into "promoters" and "non-promoters" even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 "promoter-specific" transcription factors), those that bind preferentially to the [0,500] region (282 "5' UTR-specific" TFs), and 207 of the "promiscuous" transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.
启动子的结构复杂性阻碍了其计算分析。在组织复杂且研究较少的基因组中,假阳性启动子预测很常见。准确识别转录起始位点和核心启动子区域仍然是一个未解决的问题。在本文中,我们对与启动子相关的基因组特征进行了全面分析,并表明即使在没有全长cDNA序列的情况下,概率整合算法驱动的模型也能将DNA序列准确分类为“启动子”和“非启动子”。这些模型可以基于序列多态性分布图谱、基因组DNA上的RNA测序读数、甲基化核苷酸、转录因子结合位点以及核苷酸及其组合的相对频率来构建。结合位点的位置聚类表明,水稻细胞利用三类不同的转录因子:那些优先结合[-500,0]区域的转录因子(188个“启动子特异性”转录因子)、那些优先结合[0,500]区域的转录因子(282个“5'UTR特异性”转录因子),以及207个对转录起始位点位置偏好很少或没有偏好的“混杂”转录因子。对于信息最丰富的基序,它们的位置偏好在双子叶植物和单子叶植物之间是保守的。