Lipman D J, Wilbur W J, Smith T F, Waterman M S
Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):215-26. doi: 10.1093/nar/12.1part1.215.
When evaluating sequence similarities among nucleic acids by the usual methods, statistical significance is often found when the biological significance of the similarity is dubious. We demonstrate that the known statistical properties of nucleic acid sequences strongly affect the statistical distribution of similarity values when calculated by standard procedures. We propose a series of models which account for some of these known statistical properties. The utility of the method is demonstrated in evaluating high relative similarity scores in four specific cases in which there is little biological context by which to judge the similarities. In two of the cases we identify the statistical properties which are responsible for the apparent similarity. In the other two cases the statistical significance of the similarity persists even when the known statistical properties of sequences are modelled. For one of these cases biological significance is likely while the other case remains an enigma.
在通过常规方法评估核酸之间的序列相似性时,常常会在相似性的生物学意义存疑的情况下发现统计学显著性。我们证明,核酸序列的已知统计特性在通过标准程序计算时会强烈影响相似性值的统计分布。我们提出了一系列考虑了其中一些已知统计特性的模型。该方法的实用性在四个特定案例中得到了证明,在这些案例中几乎没有生物学背景来判断相似性,却出现了较高的相对相似性得分。在其中两个案例中,我们确定了导致明显相似性的统计特性。在另外两个案例中,即使对序列的已知统计特性进行了建模,相似性的统计学显著性仍然存在。在其中一个案例中,生物学意义可能存在,而另一个案例仍然是个谜。