Hutchinson G B
Department of Medical Genetics, University of British Columbia, Vancouver, Canada.
Comput Appl Biosci. 1996 Oct;12(5):391-8. doi: 10.1093/bioinformatics/12.5.391.
To develop an algorithm utilizing differential hexamer frequency analysis to discriminate promoter from non-promoter regions in vertebrate DNA sequence, without relying upon an extensive database of known transcriptional elements.
By determining hexamer frequencies derived from known promoter regions, coding regions and non-coding regions in vertebrates' DNA sequence, and a formula first applied by Claverie and Bougueleret (1986), a discriminant measure was created that compares promoter regions with coding (D1) and non-coding (D2) sequence. The algorithm is able to identify correctly the promoter regions in 18 of 29 loci (62.1%) from an independent test data set. With program options set to identify only one promoter region in the forward strand, there are 11 false-positive predictions in 208 714 nucleotides (one false positive in 18 974 single-stranded bp). With options set to analyze sequence in discrete segments, there is no appreciable improvement in sensitivity, whereas the specificity falls off predictably. It is of particular interest than a search for a peak score (independent of an absolute threshold) is more accurate that a search based upon a fixed scoring threshold. This suggests that the selection of promoter sites may be influenced by the global properties of an entire sequence domain, rather than exclusively upon local characteristics.
开发一种利用六聚体频率差异分析的算法,以在不依赖大量已知转录元件数据库的情况下,区分脊椎动物DNA序列中的启动子区域和非启动子区域。
通过确定脊椎动物DNA序列中已知启动子区域、编码区域和非编码区域的六聚体频率,并采用Claverie和Bougueleret(1986年)首次应用的公式,创建了一种判别方法,该方法将启动子区域与编码(D1)和非编码(D2)序列进行比较。该算法能够从独立测试数据集中正确识别29个位点中的18个(62.1%)的启动子区域。当程序选项设置为仅在前导链中识别一个启动子区域时,在208714个核苷酸中有11个假阳性预测(在18974个单链碱基对中有一个假阳性)。当选项设置为以离散片段分析序列时,灵敏度没有明显提高,而特异性则可预测地下降。特别值得注意的是,寻找峰值分数(独立于绝对阈值)比基于固定评分阈值的搜索更准确。这表明启动子位点的选择可能受整个序列域的全局特性影响,而不仅仅取决于局部特征。