Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.
Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.
BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.
Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ -statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies.
We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r and r , respectively. We show through both simulations and theoretical studies that the optimal k= max(r ,r )+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.
Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.
由于新测序技术产生了大量的序列数据,基于字模式(gram,k-tuple)的无比对序列比较已经成为一个活跃的研究课题。基因组序列通常采用马尔可夫链建模,并且已经提出似然比检验或相应的近似 χ -统计量来比较两个序列。然而,在这些研究中,如何选择最佳的字长 k 尚不清楚。
我们通过最大化检测两个序列之间差异的统计功效,开发了一种选择 k 的最佳策略。令两个序列的马尔可夫链的阶分别为 r 和 r 。我们通过模拟和理论研究表明,对于长序列和下一代测序(NGS)读取数据,最优的 k= max(r,r )+1。马尔可夫链的阶可能未知,已经开发了几种方法来基于长序列和 NGS 读取来估计马尔可夫链的阶。我们研究了在使用估计阶时统计功效的损失。结果表明,对于一些马尔可夫链阶的估计器,功率损失最小。
我们的研究为比较马尔可夫序列时选择最佳字长提供了指导。