Ferreira L M, Sáfadi T, Lima R R
Departamento de Estatística, Universidade Federal de Lavras, Lavras, MG, Brasil.
Departamento de Estatística, Universidade Federal de Lavras, Lavras, MG, Brasil
Genet Mol Res. 2017 Sep 21;16(3):gmr-16-03-gmr.16039758. doi: 10.4238/gmr16039758.
The wavelets have become increasingly popular in the field of bioinformatics due to their capacity in multiresolution analysis and space-frequency localization; the latter particularity is acquired due to a moving window that runs through the analyzed space. As a feature, they have a better ability to capture hidden components of biological data and an efficient link between biological systems and the mathematical objects used to describe them. The decomposition of signals/sequences at different levels of resolution allows obtaining distinct characteristics in each level. The energy (variance) obtained at each level provides a new set of information that can be used to search similarities between sequences. We show that the behavior of GC-content sequence can be succinctly described regarding the non-decimated wavelet transform, and we indicate how this characterization can be used to improve clustering of the similar strains of the genome of the Mycobacterium tuberculosis, having a very efficient level of detail. The clustering analysis using the energy obtained at each level of the analyzed sequences was essential to verify the dissimilarity of the sequences.