McLachlan A D, Boswell D R
J Mol Biol. 1985 Sep 5;185(1):39-49. doi: 10.1016/0022-2836(85)90181-0.
We describe new tests, of general application, for deciding whether two proteins or DNA sequences are significantly homologous, in cases where the relationship is neither evidently true nor evidently false. Ralston and Bishop's comparison of the c-myc oncogene with the adenovirus E1a protein is discussed as an example. When the comparison matrix test is used to establish a homology between two sequences it is necessary that the number of high scores exceeds the expected mean level for random sequences by a statistically significant margin. The mean level itself is found from the double matching probability distribution. In examples where the number of high scores is larger than expected, but the highest score is not in itself exceptional, the variance of the numbers of scores expected for unrelated sequences is an important factor. We have analysed these variances by several methods. A simple binomial distribution gives only a rather inaccurate and low first estimate, but we derive a more rigorous and accurate statistical treatment, to take account of the correlations between scores in different parts of the comparison matrix. The theory is exact for random DNA or protein sequences with fluctuating compositions, selected by random draws from an infinite pool. In the more realistic situation, where sequences of fixed composition are formed by random permutations of the original sets, the deviations are smaller, and have been analysed by computer simulation. We find that although the relationship proposed by Ralston & Bishop, between the c-myc oncogene and adenovirus E1a proteins, appears to be significant in the binomial approximation, it is not supported by the full analysis. We conclude that, in general, great care is needed to establish any weak homology on the basis of comparisons that include no truly exceptional high scores, but merely have an enhanced number of scores at the upper end of the expected distribution.
我们描述了一些通用的新测试方法,用于判断在关系既不明显为真也不明显为假的情况下,两个蛋白质或DNA序列是否具有显著的同源性。以Ralston和Bishop对c-myc癌基因与腺病毒E1a蛋白的比较为例进行了讨论。当使用比较矩阵测试来确定两个序列之间的同源性时,高分的数量必须超过随机序列预期平均水平,且具有统计学上的显著差异。平均水平本身是从双匹配概率分布中得出的。在高分数量大于预期,但最高分本身并不异常的例子中,无关序列预期得分数量的方差是一个重要因素。我们通过几种方法分析了这些方差。简单的二项分布只能给出相当不准确且较低的初步估计,但我们推导了一种更严格、准确的统计处理方法,以考虑比较矩阵不同部分得分之间的相关性。该理论对于具有波动组成的随机DNA或蛋白质序列是精确的,这些序列是从无限库中随机抽取选择的。在更现实的情况下,由原始集合的随机排列形成固定组成的序列,偏差较小,并通过计算机模拟进行了分析。我们发现,尽管Ralston和Bishop提出的c-myc癌基因与腺病毒E1a蛋白之间的关系在二项近似中似乎具有显著性,但全面分析并不支持这一关系。我们得出结论,一般来说,在基于不包含真正异常高分、只是在预期分布上限处得分数量有所增加的比较来确定任何微弱同源性时,需要格外谨慎。