生物序列比较中词组成向量法的数学考量

A mathematical consideration of the word-composition vector method in comparison of biological sequences.

作者信息

Aita Takuyo, Husimi Yuzuru, Nishigaki Koichi

机构信息

Graduate School of Science and Engineering, Saitama University, 255 Shimo-okubo, Saitama 338-8570, Japan.

出版信息

Biosystems. 2011 Nov;106(2-3):67-75. doi: 10.1016/j.biosystems.2011.06.009. Epub 2011 Jul 1.

DOI:10.1016/j.biosystems.2011.06.009

PMID:21745534

Abstract

To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.

摘要

为了测量两个给定生物序列之间的相似性或相异性，几篇论文提出了基于“词组成向量”的度量标准。这些度量标准的本质如下。首先，我们统计两个给定序列中所有K元组词的出现频率。然后，将两个给定序列转换为它们各自的词组成向量。接下来，计算距离度量，例如两个向量之间的夹角。一个重要的问题是确定最佳词大小K。利用序列中发生的突变事件（包括替换、插入、缺失和重复）的数学模型，我们分析了组成向量之间的夹角如何依赖于突变事件。我们还从我们原来的方法考虑了最佳词大小（=分辨率）。我们的结果通过使用人工生成的序列、血红蛋白的氨基酸序列和16S核糖体RNA的核苷酸序列的计算实验得到了验证。

相似文献

A mathematical consideration of the word-composition vector method in comparison of biological sequences.

Biosystems. 2011 Nov;106(2-3):67-75. doi: 10.1016/j.biosystems.2011.06.009. Epub 2011 Jul 1.

Using Markov model to improve word normalization algorithm for biological sequence comparison.

Amino Acids. 2012 May;42(5):1867-77. doi: 10.1007/s00726-011-0906-2. Epub 2011 Apr 20.

Vector representations and related matrices of DNA primary sequence based on L-tuple.

Math Biosci. 2010 Oct;227(2):147-52. doi: 10.1016/j.mbs.2010.07.004. Epub 2010 Aug 3.

BOOL-AN: a method for comparative sequence analysis and phylogenetic reconstruction.

Mol Phylogenet Evol. 2009 Sep;52(3):887-97. doi: 10.1016/j.ympev.2009.04.019. Epub 2009 May 5.

Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison.

J Theor Biol. 2011 May 7;276(1):174-80. doi: 10.1016/j.jtbi.2011.02.005. Epub 2011 Feb 18.

A measure of DNA sequence dissimilarity based on free energy of nearest-neighbor interaction.

J Biomol Struct Dyn. 2011 Feb;28(4):557-65. doi: 10.1080/07391102.2011.10508595.

A mapping of an ensemble of mitochondrial sequences for various organisms into 3D space based on the word composition.

Mol Phylogenet Evol. 2012 Nov;65(2):380-9. doi: 10.1016/j.ympev.2012.06.023. Epub 2012 Jul 7.

A thermodynamic approach to designing structure-free combinatorial DNA word sets.

Nucleic Acids Res. 2005 Sep 2;33(15):4965-77. doi: 10.1093/nar/gki812. Print 2005.

A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

Biometrics. 1997 Dec;53(4):1431-9.

Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word.

J Theor Biol. 2012 Feb 21;295:125-31. doi: 10.1016/j.jtbi.2011.11.021. Epub 2011 Dec 1.

引用本文的文献

TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing.

Genome Biol. 2024 Nov 4;25(1):285. doi: 10.1186/s13059-024-03423-3.

HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing.

Genome Biol. 2023 Oct 5;24(1):222. doi: 10.1186/s13059-023-03053-1.

An improved alignment-free model for DNA sequence similarity metric.

BMC Bioinformatics. 2014 Sep 28;15(1):321. doi: 10.1186/1471-2105-15-321.

Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.

PLoS One. 2012;7(11):e50039. doi: 10.1371/journal.pone.0050039. Epub 2012 Nov 21.

A novel hierarchical clustering algorithm for gene sequences.

BMC Bioinformatics. 2012 Jul 23;13:174. doi: 10.1186/1471-2105-13-174.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生物序列比较中词组成向量法的数学考量

A mathematical consideration of the word-composition vector method in comparison of biological sequences.

作者信息

Aita Takuyo, Husimi Yuzuru, Nishigaki Koichi

机构信息

Graduate School of Science and Engineering, Saitama University, 255 Shimo-okubo, Saitama 338-8570, Japan.

出版信息

Biosystems. 2011 Nov;106(2-3):67-75. doi: 10.1016/j.biosystems.2011.06.009. Epub 2011 Jul 1.

DOI:10.1016/j.biosystems.2011.06.009

PMID:21745534

Abstract

摘要

生物序列比较中词组成向量法的数学考量

A mathematical consideration of the word-composition vector method in comparison of biological sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

生物序列比较中词组成向量法的数学考量

A mathematical consideration of the word-composition vector method in comparison of biological sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献