College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, People's Republic of China.
PLoS One. 2011;6(11):e26779. doi: 10.1371/journal.pone.0026779. Epub 2011 Nov 10.
Word-based models have achieved promising results in sequence comparison. However, as the important statistical properties of words in biological sequence, how to use the overlapping structures and background information of the words to improve sequence comparison is still a problem. This paper proposed a new statistical method that integrates the overlapping structures and the background information of the words in biological sequences. To assess the effectiveness of this integration for sequence comparison, two sets of evaluation experiments were taken to test the proposed model. The first one, performed via receiver operating curve analysis, is the application of proposed method in discrimination between functionally related regulatory sequences and unrelated sequences, intron and exon. The second experiment is to evaluate the performance of the proposed method with f-measure for clustering Hepatitis E virus genotypes. It was demonstrated that the proposed method integrating the overlapping structures and the background information of words significantly improves biological sequence comparison and outperforms the existing models.
基于词的模型在序列比较中取得了令人瞩目的成果。然而,作为生物序列中词的重要统计属性,如何利用词的重叠结构和背景信息来改进序列比较仍然是一个问题。本文提出了一种新的统计方法,该方法集成了生物序列中词的重叠结构和背景信息。为了评估这种集成对序列比较的有效性,进行了两组评估实验来测试所提出的模型。第一个实验是通过接收者操作曲线分析进行的,即将所提出的方法应用于区分功能相关的调控序列和不相关的序列、内含子和外显子。第二个实验是通过 f 测度评估所提出的方法在聚类丙型肝炎病毒基因型方面的性能。结果表明,该方法显著提高了生物序列比较的性能,并优于现有的模型。