区分DNA序列中真实匹配与虚假匹配。

Separating significant matches from spurious matches in DNA sequences.

作者信息

Devillers Hugo, Schbath Sophie

机构信息

INRA, UR1077, Mathématique, Informatique, et Génome, Jouy-en-Josas, France.

出版信息

J Comput Biol. 2012 Jan;19(1):1-12. doi: 10.1089/cmb.2011.0070. Epub 2011 Dec 9.

DOI:10.1089/cmb.2011.0070

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3244807/

Abstract

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.

摘要

单词匹配被广泛用于比较基因组序列。完整基因组比对方法通常依赖于将匹配作为构建比对的锚点，并且各种表征大序列之间相似性的无比对方法都是基于单词匹配。在从两个基因组序列比较中检索到的匹配中，其中一部分可能对应于虚假匹配（SM），即偶然获得而非通过同源关系得到的匹配。虚假匹配的数量取决于用于检索它们的算法中必须设置的最小匹配长度（ℓ）。实际上，如果ℓ太小，会检索到许多匹配，但其中大多数是虚假匹配。相反，如果ℓ太大，检索到的匹配较少，但许多较小的显著匹配肯定会被忽略。迄今为止，ℓ的选择主要取决于经验阈值，而非稳健的统计方法。为克服这一问题，我们提出一种基于使用几何分布混合模型的统计方法，以表征从两个基因组序列比较中获得的匹配长度的分布。

相似文献

1

Separating significant matches from spurious matches in DNA sequences.

J Comput Biol. 2012 Jan;19(1):1-12. doi: 10.1089/cmb.2011.0070. Epub 2011 Dec 9.

2

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

3

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

4

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.

5

Sequence Comparison Without Alignment: The SpaM Approaches.

Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.

6

Adaptive seeds tame genomic sequence comparison.

Genome Res. 2011 Mar;21(3):487-93. doi: 10.1101/gr.113985.110. Epub 2011 Jan 5.

7

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.

PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.

8

CSA: an efficient algorithm to improve circular DNA multiple alignment.

BMC Bioinformatics. 2009 Jul 23;10:230. doi: 10.1186/1471-2105-10-230.

9

The whole alignment and nothing but the alignment: the problem of spurious alignment flanks.

Nucleic Acids Res. 2008 Oct;36(18):5863-71. doi: 10.1093/nar/gkn579. Epub 2008 Sep 16.

10

Finding anchors for genomic sequence comparison.

J Comput Biol. 2005 Jul-Aug;12(6):762-76. doi: 10.1089/cmb.2005.12.762.

引用本文的文献

1

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

2

Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes.

Front Bioeng Biotechnol. 2016 Jun 8;4:35. doi: 10.3389/fbioe.2016.00035. eCollection 2016.

本文引用的文献

1

Robustness assessment of whole bacterial genome segmentations.

J Comput Biol. 2011 Sep;18(9):1155-65. doi: 10.1089/cmb.2011.0115.

2

Alignment-free sequence comparison (I): statistics and power.

J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.

3

Characterizing the D2 statistic: word matches in biological sequences.

Stat Appl Genet Mol Biol. 2009;8:Article 43. doi: 10.2202/1544-6115.1447. Epub 2009 Oct 8.

4

Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions.

Proc Natl Acad Sci U S A. 2009 Oct 6;106(40):17077-82. doi: 10.1073/pnas.0909377106. Epub 2009 Sep 28.

5

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.

6

MOSAIC: an online database dedicated to the comparative genomics of bacterial strains at the intra-species level.

BMC Bioinformatics. 2008 Nov 27;9:498. doi: 10.1186/1471-2105-9-498.

7

A genomic distance based on MUM indicates discontinuity between most bacterial species and genera.

J Bacteriol. 2009 Jan;191(1):91-9. doi: 10.1128/JB.01202-08. Epub 2008 Oct 31.

8

Space efficient computation of rare maximal exact matches between multiple sequences.

J Comput Biol. 2008 May;15(4):357-77. doi: 10.1089/cmb.2007.0105.

9

On the length of the longest exact position match in a random sequence.

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jan-Mar;4(1):153-6. doi: 10.1109/TCBB.2007.1023.

10

M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species.

BMC Bioinformatics. 2006 Oct 5;7:433. doi: 10.1186/1471-2105-7-433.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。