Schbath S, Prum B, de Turckheim E
INRA, Département de Biométrie et Intelligence Artificielle, Jouy-en-Josas, France.
J Comput Biol. 1995 Fall;2(3):417-37. doi: 10.1089/cmb.1995.2.417.
Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.
识别异常基序常用于从长DNA序列中提取信息。该方法的两个难点在于定义单词预期频率的模型选择,以及单词W出现次数与其估计值之间差异T(W)的方差近似值。我们在此考虑不同的马尔可夫链模型,其转移概率可为平稳或周期性的。我们通过给定定义模型的寡核苷酸计数情况下W出现次数的条件方差来估计差异T(W)的方差。两个应用展示了如何使用与计数相关的渐近标准正态统计量,根据其异常单词来描述给定序列。对大肠杆菌和枯草芽孢杆菌的序列进行了比较,分析了它们的异常三核苷酸和四核苷酸。对于这两种细菌,异常三字主要出现在编码框中。在不同模型中分析了大肠杆菌回文计数,结果表明许多过量的单词是避免出现的回文的单字母突变。