Newberg Lee A, McCue Lee Ann, Lawrence Charles E
NYSDOH Wadsworth Center & Rensselaer Polytechnic Institute Department of Computer Science.
Stat Appl Genet Mol Biol. 2005;4:Article13. doi: 10.2202/1544-6115.1135. Epub 2005 Jun 1.
Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality. We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic tree. Using these we find that approaches based upon sequence weights can perform very poorly in comparison to approaches based upon a theoretically optimal maximum-likelihood method in the inference of the parameters of a position-weight matrix. Specifically, we find that among a collection of primate sequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters. We also show how to employ the variance estimators to obtain a greedy ordering of species for sequencing. Application of this ordering for the weighted estimators to a primate collection yields a curve with a long plateau that is not observed with maximum-likelihood estimators. This plateau indicates that the use of weighted estimators on these data seriously limits the utility of obtaining the sequences of more than two or three additional species.
基于序列权重从比对后的输入构建核苷酸位置权重矩阵的方法很流行,但在衡量其质量方面却很少有人付出努力。我们推导了最优序列权重,以最小化由系统发育树相关的序列的碱基频率参数估计值的方差之和。利用这些权重,我们发现,在推断位置权重矩阵的参数时,与基于理论上最优的最大似然方法相比,基于序列权重的方法可能表现得非常差。具体而言,我们发现在一组灵长类序列中,即使是最优的序列权重方法,在推断碱基频率参数时的效率也仅为最大似然方法的51%。我们还展示了如何使用方差估计值来获得用于测序的物种的贪婪排序。将这种排序应用于加权估计值到一组灵长类序列中,会产生一条有很长平稳期的曲线,而最大似然估计值则不会出现这种情况。这个平稳期表明,在这些数据上使用加权估计值严重限制了获取两到三个以上额外物种序列的效用。