使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。

Optimal choice of word length when comparing two Markov sequences using a χ -statistic.

机构信息

Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.

Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.

出版信息

BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.

DOI:10.1186/s12864-017-4020-z

PMID:28984181

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5629589/

Abstract

BACKGROUND

Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ -statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies.

RESULTS

We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r and r , respectively. We show through both simulations and theoretical studies that the optimal k= max(r ,r )+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.

CONCLUSION

Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

摘要

背景

由于新测序技术产生了大量的序列数据，基于字模式（gram，k-tuple）的无比对序列比较已经成为一个活跃的研究课题。基因组序列通常采用马尔可夫链建模，并且已经提出似然比检验或相应的近似 χ -统计量来比较两个序列。然而，在这些研究中，如何选择最佳的字长 k 尚不清楚。

结果

我们通过最大化检测两个序列之间差异的统计功效，开发了一种选择 k 的最佳策略。令两个序列的马尔可夫链的阶分别为 r 和 r 。我们通过模拟和理论研究表明，对于长序列和下一代测序（NGS）读取数据，最优的 k= max(r,r )+1。马尔可夫链的阶可能未知，已经开发了几种方法来基于长序列和 NGS 读取来估计马尔可夫链的阶。我们研究了在使用估计阶时统计功效的损失。结果表明，对于一些马尔可夫链阶的估计器，功率损失最小。

结论

我们的研究为比较马尔可夫序列时选择最佳字长提供了指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ffa/5629589/872cbb04cf7e/12864_2017_4020_Fig1_HTML.jpg

相似文献

Optimal choice of word length when comparing two Markov sequences using a χ -statistic.使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。

BMC Genomics. 2017 Oct 3;18(Suppl 6):732. doi: 10.1186/s12864-017-4020-z.

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

Hidden Markov Models in Bioinformatics: SNV Inference from Next Generation Sequence.生物信息学中的隐马尔可夫模型：从下一代测序中推断单核苷酸变异

Methods Mol Biol. 2017;1552:123-133. doi: 10.1007/978-1-4939-6753-7_9.

Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals.利用来自匿名个体混合样本的短的、随机的和部分序列估计进化参数。

BMC Bioinformatics. 2015 Nov 4;16:357. doi: 10.1186/s12859-015-0810-y.

Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns.基于下一代测序读段和可变长度模式的无组装基因组比较。

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2105-15-S9-S1. Epub 2014 Sep 10.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。

Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.

Alignment-free sequence comparison based on next-generation sequencing reads.基于新一代测序读数的无比对序列比较。

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

Comparison of metagenomic samples using sequence signatures.基于序列特征比较宏基因组样本。

BMC Genomics. 2012 Dec 27;13:730. doi: 10.1186/1471-2164-13-730.

引用本文的文献

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

An alignment-free method for detection of missing regions for phylogenetic analysis.一种用于系统发育分析中缺失区域检测的无比对方法。

Heliyon. 2024 Jun 4;10(11):e32227. doi: 10.1016/j.heliyon.2024.e32227. eCollection 2024 Jun 15.

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.KITSUNE：一种用于为无比对系统发育基因组分析确定经验最优k-mer长度的工具。

Front Bioeng Biotechnol. 2020 Sep 23;8:556413. doi: 10.3389/fbioe.2020.556413. eCollection 2020.

Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用

Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.

Afann: bias adjustment for alignment-free sequence comparison based on sequencing data using neural network regression.基于神经网络回归的无比对序列比对偏差调整：使用测序数据。

Genome Biol. 2019 Dec 4;20(1):266. doi: 10.1186/s13059-019-1872-3.

The International Conference on Intelligent Biology and Medicine (ICIBM) 2016: summary and innovation in genomics.2016 年智能生物学与医学国际会议（ICIBM）：基因组学的总结与创新。

BMC Genomics. 2017 Oct 3;18(Suppl 6):703. doi: 10.1186/s12864-017-4018-6.

本文引用的文献

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

A geometric interpretation for local alignment-free sequence comparison.一种用于局部无比对序列比较的几何解释。

J Comput Biol. 2013 Jul;20(7):471-85. doi: 10.1089/cmb.2012.0280.

Alignment-free sequence comparison based on next-generation sequencing reads.基于新一代测序读数的无比对序列比较。

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses.一刀切并不适用：关于马尔可夫模型阶数如何决定基因组序列分析的性能。

Nucleic Acids Res. 2013 Feb 1;41(3):1416-24. doi: 10.1093/nar/gks1285. Epub 2012 Dec 24.

Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.基于词邻计数的哺乳动物增强子序列相似性估计。

Bioinformatics. 2012 Mar 1;28(5):656-63. doi: 10.1093/bioinformatics/bts028. Epub 2012 Jan 12.

ChIP-Seq identification of weakly conserved heart enhancers.ChIP-Seq 鉴定弱保守的心脏增强子。

Nat Genet. 2010 Sep;42(9):806-10. doi: 10.1038/ng.650. Epub 2010 Aug 22.

Alignment-free sequence comparison (I): statistics and power.无比对序列比较（I）：统计学与效能

J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.

ChIP-seq accurately predicts tissue-specific activity of enhancers.染色质免疫沉淀测序（ChIP-seq）能准确预测增强子的组织特异性活性。

Nature. 2009 Feb 12;457(7231):854-8. doi: 10.1038/nature07730.

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.基于特征频率谱（FFP）和最优分辨率的无比对基因组比较

Proc Natl Acad Sci U S A. 2009 Feb 24;106(8):2677-82. doi: 10.1073/pnas.0813249106. Epub 2009 Feb 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用 χ ²统计量比较两个马尔可夫序列时的最佳字长选择。

Optimal choice of word length when comparing two Markov sequences using a χ -statistic.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献