Suppr超能文献

使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。

Optimal choice of word length when comparing two Markov sequences using a χ -statistic.

机构信息

Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.

Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.

出版信息

BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.

Abstract

BACKGROUND

Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ -statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies.

RESULTS

We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r and r , respectively. We show through both simulations and theoretical studies that the optimal k= max(r ,r )+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.

CONCLUSION

Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

摘要

背景

由于新测序技术产生了大量的序列数据,基于字模式(gram,k-tuple)的无比对序列比较已经成为一个活跃的研究课题。基因组序列通常采用马尔可夫链建模,并且已经提出似然比检验或相应的近似 χ -统计量来比较两个序列。然而,在这些研究中,如何选择最佳的字长 k 尚不清楚。

结果

我们通过最大化检测两个序列之间差异的统计功效,开发了一种选择 k 的最佳策略。令两个序列的马尔可夫链的阶分别为 r 和 r 。我们通过模拟和理论研究表明,对于长序列和下一代测序(NGS)读取数据,最优的 k= max(r,r )+1。马尔可夫链的阶可能未知,已经开发了几种方法来基于长序列和 NGS 读取来估计马尔可夫链的阶。我们研究了在使用估计阶时统计功效的损失。结果表明,对于一些马尔可夫链阶的估计器,功率损失最小。

结论

我们的研究为比较马尔可夫序列时选择最佳字长提供了指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ffa/5629589/872cbb04cf7e/12864_2017_4020_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验