Claverie J M, Bougueleret L
Nucleic Acids Res. 1986 Jan 10;14(1):179-96. doi: 10.1093/nar/14.1.179.
Nucleotide or amino-acid sequences are interpreted as successions of words of length k (k-tuples) the frequencies of which are highly variable in different statistical populations of genes or proteins. After building k-tuple reference tables from coherent subsets or entire data banks, the local information content profile of individual sequences is drawn. Anomalous regions (peaks or depressions) of such a profile can lead to the discovery and identification of specific sequence patterns. Along the same principle, the simultaneous use of two reference statistical populations and the computation of an index combining the two information profiles lead to a general and powerful discriminant analysis methods. The identification of a "signal" associated with gene conversion, the introns/exons discrimination and the location of function specific patterns in proteins are given as examples of successful applications of this heuristic informational approach.
核苷酸或氨基酸序列被解释为长度为k(k元组)的单词序列,其频率在不同的基因或蛋白质统计群体中高度可变。在从连贯子集或整个数据库构建k元组参考表之后,绘制单个序列的局部信息含量图谱。这种图谱的异常区域(峰值或凹陷)可导致发现和识别特定的序列模式。基于相同的原理,同时使用两个参考统计群体并计算结合两个信息图谱的指数,可得出一种通用且强大的判别分析方法。与基因转换相关的“信号”识别、内含子/外显子区分以及蛋白质中功能特异性模式的定位,均作为这种启发式信息方法成功应用的示例给出。