Devillers Hugo, Schbath Sophie
INRA, UR1077, Mathématique, Informatique, et Génome, Jouy-en-Josas, France.
J Comput Biol. 2012 Jan;19(1):1-12. doi: 10.1089/cmb.2011.0070. Epub 2011 Dec 9.
Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.
单词匹配被广泛用于比较基因组序列。完整基因组比对方法通常依赖于将匹配作为构建比对的锚点,并且各种表征大序列之间相似性的无比对方法都是基于单词匹配。在从两个基因组序列比较中检索到的匹配中,其中一部分可能对应于虚假匹配(SM),即偶然获得而非通过同源关系得到的匹配。虚假匹配的数量取决于用于检索它们的算法中必须设置的最小匹配长度(ℓ)。实际上,如果ℓ太小,会检索到许多匹配,但其中大多数是虚假匹配。相反,如果ℓ太大,检索到的匹配较少,但许多较小的显著匹配肯定会被忽略。迄今为止,ℓ的选择主要取决于经验阈值,而非稳健的统计方法。为克服这一问题,我们提出一种基于使用几何分布混合模型的统计方法,以表征从两个基因组序列比较中获得的匹配长度的分布。