整合单词的重叠结构和背景信息能显著提高生物序列比较的效果。

Integrating overlapping structures and background information of words significantly improves biological sequence comparison.

机构信息

College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, People's Republic of China.

出版信息

PLoS One. 2011;6(11):e26779. doi: 10.1371/journal.pone.0026779. Epub 2011 Nov 10.

DOI:10.1371/journal.pone.0026779

PMID:22102867

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3213098/

Abstract

Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.

摘要

基于词的模型在序列比较中取得了令人瞩目的成果。然而，作为生物序列中词的重要统计属性，如何利用词的重叠结构和背景信息来改进序列比较仍然是一个问题。本文提出了一种新的统计方法，该方法集成了生物序列中词的重叠结构和背景信息。为了评估这种集成对序列比较的有效性，进行了两组评估实验来测试所提出的模型。第一个实验是通过接收者操作曲线分析进行的，即将所提出的方法应用于区分功能相关的调控序列和不相关的序列、内含子和外显子。第二个实验是通过 f 测度评估所提出的方法在聚类丙型肝炎病毒基因型方面的性能。结果表明，该方法显著提高了生物序列比较的性能，并优于现有的模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bb15/3213098/3857eaefa172/pone.0026779.g001.jpg

相似文献

Integrating overlapping structures and background information of words significantly improves biological sequence comparison.整合单词的重叠结构和背景信息能显著提高生物序列比较的效果。

PLoS One. 2011;6(11):e26779. doi: 10.1371/journal.pone.0026779. Epub 2011 Nov 10.

Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison.词汇频率的数值特征及其在序列比较相似度度量中的应用。

J Theor Biol. 2011 May 7;276(1):174-80. doi: 10.1016/j.jtbi.2011.02.005. Epub 2011 Feb 18.

A novel statistical measure for sequence comparison on the basis of k-word counts.基于 k 字计数的序列比较的一种新的统计度量。

J Theor Biol. 2013 Feb 7;318:91-100. doi: 10.1016/j.jtbi.2012.10.035. Epub 2012 Nov 9.

An efficient binomial model-based measure for sequence comparison and its application.基于二项式模型的高效序列比较度量及其应用。

J Biomol Struct Dyn. 2011 Apr;28(5):833-43. doi: 10.1080/07391102.2011.10508611.

Using Markov model to improve word normalization algorithm for biological sequence comparison.使用马尔可夫模型改进生物序列比对的词法归一化算法。

Amino Acids. 2012 May;42(5):1867-77. doi: 10.1007/s00726-011-0906-2. Epub 2011 Apr 20.

Study of LZ-word distribution and its application for sequence comparison.LZ 词分布研究及其在序列比较中的应用。

J Theor Biol. 2013 Nov 7;336:52-60. doi: 10.1016/j.jtbi.2013.07.008. Epub 2013 Jul 19.

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.蛋白质的k字统计量比较研究：从序列到“序列空间”

BMC Bioinformatics. 2008 Sep 23;9:394. doi: 10.1186/1471-2105-9-394.

Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison.马尔可夫模型加k词分布：一种产生用于序列比较的新型统计量度的协同作用。

Bioinformatics. 2008 Oct 15;24(20):2296-302. doi: 10.1093/bioinformatics/btn436. Epub 2008 Aug 18.

Using Gaussian model to improve biological sequence comparison.利用高斯模型改进生物序列比较。

J Comput Chem. 2010 Jan 30;31(2):351-61. doi: 10.1002/jcc.21322.

A statistical method for alignment-free comparison of regulatory sequences.一种用于调控序列无比对比较的统计方法。

Bioinformatics. 2007 Jul 1;23(13):i249-55. doi: 10.1093/bioinformatics/btm211.

引用本文的文献

Association of magnesium intake with sleep duration and sleep quality: findings from the CARDIA study.镁的摄入量与睡眠时间和睡眠质量的关系：来自 CARDIA 研究的结果。

Sleep. 2022 Apr 11;45(4). doi: 10.1093/sleep/zsab276.

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform.SSAW：一种基于平稳离散小波变换的新序列相似性分析方法。

BMC Bioinformatics. 2018 May 2;19(1):165. doi: 10.1186/s12859-018-2155-9.

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.基于字分析的无比对基因序列比较：最新方法综述

Brief Bioinform. 2014 Nov;15(6):890-905. doi: 10.1093/bib/bbt052. Epub 2013 Jul 31.

本文引用的文献

Applying phylogenetic analysis to viral livestock diseases: moving beyond molecular typing.应用系统发育分析于动物病毒病：超越分子分型。

Vet J. 2010 May;184(2):130-7. doi: 10.1016/j.tvjl.2009.02.015. Epub 2009 Mar 28.

Phylogeny, classification and evolutionary insights into pestiviruses.瘟病毒的系统发育、分类及进化见解

Virology. 2009 Mar 15;385(2):351-7. doi: 10.1016/j.virol.2008.12.004. Epub 2009 Jan 24.

Joint evolutionary trees: a large-scale method to predict protein interfaces based on sequence sampling.联合进化树：一种基于序列采样预测蛋白质界面的大规模方法。

PLoS Comput Biol. 2009 Jan;5(1):e1000267. doi: 10.1371/journal.pcbi.1000267. Epub 2009 Jan 23.

Fast algorithms for computing sequence distances by exhaustive substring composition.通过穷举子串组合计算序列距离的快速算法。

Algorithms Mol Biol. 2008 Oct 28;3:13. doi: 10.1186/1748-7188-3-13.

Probabilistic phylogenetic inference with insertions and deletions.带插入和缺失的概率系统发育推断

PLoS Comput Biol. 2008 Sep 19;4(9):e1000172. doi: 10.1371/journal.pcbi.1000172.

Bioinformatics. 2008 Oct 15;24(20):2296-302. doi: 10.1093/bioinformatics/btn436. Epub 2008 Aug 18.

PLoS Comput Biol. 2008 Jul 18;4(7):e1000116. doi: 10.1371/journal.pcbi.1000116.

An improved string composition method for sequence comparison.一种用于序列比较的改进型字符串组成方法。

BMC Bioinformatics. 2008 May 28;9 Suppl 6(Suppl 6):S15. doi: 10.1186/1471-2105-9-S6-S15.

Molecular characterization and phylogenetic analysis of the complete genome of a hepatitis E virus from European swine.来自欧洲猪的戊型肝炎病毒全基因组的分子特征及系统发育分析

Virus Genes. 2008 Aug;37(1):39-48. doi: 10.1007/s11262-008-0246-9. Epub 2008 Jun 6.

Segmentation of short human exons based on spectral features of double curves.基于双曲线光谱特征的人类短外显子分割

Int J Data Min Bioinform. 2008;2(1):15-35. doi: 10.1504/ijdmb.2008.016754.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

整合单词的重叠结构和背景信息能显著提高生物序列比较的效果。

Integrating overlapping structures and background information of words significantly improves biological sequence comparison.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献