Narang Vipin, Sung Wing-Kin, Mittal Ankush
Department of Computer Science, S16 #06-02, 3 Science Drive 2, National University of Singapore, Singapore 117543, Singapore.
Artif Intell Med. 2005 Sep-Oct;35(1-2):107-19. doi: 10.1016/j.artmed.2005.02.005.
The gene promoter region controls transcriptional initiation of a gene, which is the most important step in gene regulation. In-silico detection of promoter region in genomic sequences has a number of applications in gene discovery and understanding gene expression regulation. However, computational prediction of eukaryotic poly-II promoters has remained a difficult task. This paper introduces a novel statistical technique for detecting promoter regions in long genomic sequences.
A number of existing techniques analyze the occurrence frequencies of oligonucleotides in promoter sequences as compared to other genomic regions. In contrast, the present work studies the positional densities of oligonucleotides in promoter sequences. The analysis does not require any non-promoter sequence dataset or any model of the background oligonucleotide content of the genome. The statistical model learnt from a dataset of promoter sequences automatically recognizes a number of transcription factor binding sites simultaneously with their occurrence positions relative to the transcription start site. Based on this model, a continuous naïve Bayes classifier is developed for the detection of human promoters and transcription start sites in genomic sequences.
The present study extends the scope of statistical models in general promoter modeling and prediction. Promoter sequence features learnt by the model correlate well with known biological facts. Results of human transcription start site prediction compare favorably with existing 2nd generation promoter prediction tools.
基因启动子区域控制基因的转录起始,这是基因调控中最重要的步骤。在基因组序列中通过计算机模拟检测启动子区域在基因发现和理解基因表达调控方面有许多应用。然而,真核生物聚合酶II启动子的计算预测仍然是一项艰巨的任务。本文介绍了一种用于检测长基因组序列中启动子区域的新型统计技术。
许多现有技术通过比较启动子序列与其他基因组区域中寡核苷酸的出现频率来进行分析。相比之下,本研究考察启动子序列中寡核苷酸的位置密度。该分析不需要任何非启动子序列数据集或基因组背景寡核苷酸含量的任何模型。从启动子序列数据集中学习到的统计模型会同时自动识别多个转录因子结合位点及其相对于转录起始位点的出现位置。基于此模型,开发了一种连续朴素贝叶斯分类器,用于检测基因组序列中的人类启动子和转录起始位点。
本研究扩展了统计模型在一般启动子建模和预测方面的范围。该模型所学习到的启动子序列特征与已知生物学事实高度相关。人类转录起始位点预测结果与现有的第二代启动子预测工具相比更具优势。