无比对序列比较（I）：统计学与效能

Alignment-free sequence comparison (I): statistics and power.

作者信息

Reinert Gesine, Chew David, Sun Fengzhu, Waterman Michael S

机构信息

Department of Statistics, University of Oxford, Oxford OX1 3TG, UK.

出版信息

J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.

DOI:10.1089/cmb.2009.0198

PMID:20001252

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2818754/

Abstract

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.

摘要

大规模比较两个生物序列之间的相似性是计算生物学中的一个主要问题；一种快速方法，即D(2)统计量，依赖于对两个序列的k元组含量进行比较。尽管多年来人们已经知道D(2)统计量不适合这项任务，因为它往往受单序列噪声的主导，但迄今为止尚未提出合适的调整方法。在本文中，我们提出了D(2)词计数统计量的两个新变体，我们称之为D(2)(S)和D(2)()。对于作为自标准化统计量的D(2)(S)，我们表明当序列长度趋于无穷大时，该统计量渐近正态分布，且不受单个序列中噪声的主导。第二个统计量D(2)()在我们的示例中，在检测两个序列之间相关性的功效方面优于D(2)(S)；但是尽管从D(2)(*)的渐近分布进行模拟很简单，但我们无法提供用于功效计算的封闭形式。

相似文献

Alignment-free sequence comparison (I): statistics and power.无比对序列比较（I）：统计学与效能

J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.

Alignment-free sequence comparison (II): theoretical power of comparison statistics.无比对序列比较（II）：比较统计量的理论功效

J Comput Biol. 2010 Nov;17(11):1467-90. doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.

New powerful statistics for alignment-free sequence comparison under a pattern transfer model.新型无比对模式下序列比对的强大统计模型。

J Theor Biol. 2011 Sep 7;284(1):106-16. doi: 10.1016/j.jtbi.2011.06.020. Epub 2011 Jun 25.

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.随机序列之间精确和近似单词匹配的渐近行为及最优单词大小

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。

Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.

Alignment-free sequence comparison for biologically realistic sequences of moderate length.针对中等长度的生物现实序列进行无比对序列比较。

Stat Appl Genet Mol Biol. 2012;11(1):Article 3.

The distribution of word matches between Markovian sequences with periodic boundary conditions.具有周期性边界条件的马尔可夫序列之间单词匹配的分布。

J Comput Biol. 2014 Jan;21(1):41-63. doi: 10.1089/cmb.2012.0277. Epub 2013 Oct 26.

tuple_plot: fast pairwise nucleotide sequence comparison with noise suppression.元组图：具有噪声抑制功能的快速成对核苷酸序列比较。

Bioinformatics. 2006 Aug 1;22(15):1917-8. doi: 10.1093/bioinformatics/btl277. Epub 2006 Jun 9.

Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.生物序列的进化意味着全局和局部两两比对得分都呈I型极值分布。

BMC Bioinformatics. 2008 Aug 7;9:332. doi: 10.1186/1471-2105-9-332.

A pairwise alignment algorithm which favors clusters of blocks.一种有利于块簇的成对比对算法。

J Comput Biol. 2005;12(1):33-47. doi: 10.1089/cmb.2005.12.33.

引用本文的文献

Energy entropy vector: a novel approach for efficient microbial genomic sequence analysis and classification.能量熵向量：一种用于高效微生物基因组序列分析和分类的新方法。

Brief Bioinform. 2025 Sep 6;26(5). doi: 10.1093/bib/bbaf459.

An alignment- and reference-free strategy using -mer present pattern for population genomic analyses.一种使用-mer呈现模式的无比对和无参考策略用于群体基因组分析。

Mycology. 2024 Jun 5;16(1):309-323. doi: 10.1080/21501203.2024.2358868. eCollection 2025.

Motif distribution in genomes gives insights into gene clustering and co-regulation.基因组中的基序分布有助于深入了解基因聚类和共调控。

Nucleic Acids Res. 2025 Jan 7;53(1). doi: 10.1093/nar/gkae1178.

TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing.TDFPS-Designer：一种用于纳米孔测序中条码设计和选择的高效工具包。

Genome Biol. 2024 Nov 4;25(1):285. doi: 10.1186/s13059-024-03423-3.

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR：生物数据中低复杂度区域的无比对模拟、映射和可视化。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.

HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing.HycDemux：一种用于纳米孔测序中准确进行带条码样本解复用的混合无监督方法。

Genome Biol. 2023 Oct 5;24(1):222. doi: 10.1186/s13059-023-03053-1.

An ensemble method for prediction of phage-based therapy against bacterial infections.一种用于预测基于噬菌体的细菌感染治疗方法的集成方法。

Front Microbiol. 2023 Mar 23;14:1148579. doi: 10.3389/fmicb.2023.1148579. eCollection 2023.

Reference-free phylogeny from sequencing data.基于测序数据的无参考系统发育分析

BioData Min. 2023 Mar 27;16(1):13. doi: 10.1186/s13040-023-00329-x.

Bioinformatics approaches for unveiling virus-host interactions.用于揭示病毒-宿主相互作用的生物信息学方法。

Comput Struct Biotechnol J. 2023;21:1774-1784. doi: 10.1016/j.csbj.2023.02.044. Epub 2023 Feb 27.

Feature extraction based on microstate sequences for EEG-based emotion recognition.基于微状态序列的脑电情感识别特征提取

Front Psychol. 2022 Dec 23;13:1065196. doi: 10.3389/fpsyg.2022.1065196. eCollection 2022.

本文引用的文献

A statistical method for alignment-free comparison of regulatory sequences.一种用于调控序列无比对比较的统计方法。

Bioinformatics. 2007 Jul 1;23(13):i249-55. doi: 10.1093/bioinformatics/btm211.

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.随机序列之间精确和近似单词匹配的渐近行为及最优单词大小

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.

Distributional regimes for the number of k-word matches between two random sequences.两个随机序列之间k词匹配数的分布模式。

Proc Natl Acad Sci U S A. 2002 Oct 29;99(22):13980-9. doi: 10.1073/pnas.202468099. Epub 2002 Oct 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验