Radomski Jan P, Slonimski Piotr P
Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University, Pawińskiego 5A, Bldg. D, 02106 Warsaw, Poland.
C R Biol. 2007 Jan;330(1):33-48. doi: 10.1016/j.crvi.2006.11.001. Epub 2006 Dec 1.
A method is proposed to represent and to analyze complete genome sequences (52 species from procaryotes and eukaryotes), based upon n-gram sequence's frequencies of amino acid pairs (bigrams), separated by a given number of other residues. For each of the species analyzed, it allows us to construct over-abundant and over-deficient occurrence profiles, summarizing amino acid bigram frequencies over the entire genome. The method deals efficiently with a sparseness of statistical representations of individual sequences, and describes every gene sequence in the same way, independently of its length and of the genome sizes. The frequency of over-abundant and over-deficient occurrences of bigrams presents a singular periodicity around 3.5 peptide bonds, suggesting a relation with the alpha helical secondary structure.