Smith T F, Waterman M S, Burks C
Nucleic Acids Res. 1985 Jan 25;13(2):645-56. doi: 10.1093/nar/13.2.645.
All pairs of a large set of known vertebrate DNA sequences were searched by computer for most similar segments. Analysis of this data shows that the computed similarity scores are distributed proportionally to the logarithm of the product of the lengths of the sequences involved. This distribution is closely related to recent results of Erdos and others on the longest run of heads in coin tossing. A simple rule is derived for determination of statistical significance of the similarity scores and to assist in relating statistical and biological significance.
通过计算机对一大组已知脊椎动物DNA序列的所有序列对进行搜索,以寻找最相似的片段。对这些数据的分析表明,计算出的相似性得分与所涉及序列长度乘积的对数成比例分布。这种分布与埃尔德什等人最近关于抛硬币中最长正面序列的结果密切相关。推导了一个简单规则,用于确定相似性得分的统计显著性,并有助于关联统计显著性和生物学显著性。