Schranz Harold W, Yap Von Bing, Easteal Simon, Knight Rob, Huttley Gavin A
John Curtin School of Medical Research, The Australian National University, Canberra, ACT, Australia.
BMC Bioinformatics. 2008 Dec 19;9:550. doi: 10.1186/1471-2105-9-550.
Continuous-time Markov models allow flexible, parametrically succinct descriptions of sequence divergence. Non-reversible forms of these models are more biologically realistic but are challenging to develop. The instantaneous rate matrices defined for these models are typically transformed into substitution probability matrices using a matrix exponentiation algorithm that employs eigendecomposition, but this algorithm has characteristic vulnerabilities that lead to significant errors when a rate matrix possesses certain 'pathological' properties. Here we tested whether pathological rate matrices exist in nature, and consider the suitability of different algorithms to their computation.
We used concatenated protein coding gene alignments from microbial genomes, primate genomes and independent intron alignments from primate genomes. The Taylor series expansion and eigendecomposition matrix exponentiation algorithms were compared to the less widely employed, but more robust, Padé with scaling and squaring algorithm for nucleotide, dinucleotide, codon and trinucleotide rate matrices. Pathological dinucleotide and trinucleotide matrices were evident in the microbial data set, affecting the eigendecomposition and Taylor algorithms respectively. Even using a conservative estimate of matrix error (occurrence of an invalid probability), both Taylor and eigendecomposition algorithms exhibited substantial error rates: ~100% of all exonic trinucleotide matrices were pathological to the Taylor algorithm while ~10% of codon positions 1 and 2 dinucleotide matrices and intronic trinucleotide matrices, and ~30% of codon matrices were pathological to eigendecomposition. The majority of Taylor algorithm errors derived from occurrence of multiple unobserved states. A small number of negative probabilities were detected from the Padé algorithm on trinucleotide matrices that were attributable to machine precision. Although the Padé algorithm does not facilitate caching of intermediate results, it was up to 3x faster than eigendecomposition on the same matrices.
Development of robust software for computing non-reversible dinucleotide, codon and higher evolutionary models requires implementation of the Padé with scaling and squaring algorithm.
连续时间马尔可夫模型能够灵活、简洁地以参数形式描述序列分歧。这些模型的不可逆形式在生物学上更符合实际,但开发起来具有挑战性。为这些模型定义的瞬时速率矩阵通常使用基于特征分解的矩阵指数算法转换为替换概率矩阵,但该算法具有特定的弱点,当速率矩阵具有某些“病态”属性时会导致显著误差。在此,我们测试了自然界中是否存在病态速率矩阵,并考虑不同算法对其计算的适用性。
我们使用了来自微生物基因组、灵长类基因组的串联蛋白质编码基因比对以及灵长类基因组的独立内含子比对。将泰勒级数展开和特征分解矩阵指数算法与应用较少但更稳健的带缩放和平方的帕德算法进行比较,用于核苷酸、二核苷酸、密码子和三核苷酸速率矩阵。病态的二核苷酸和三核苷酸矩阵在微生物数据集中很明显,分别影响特征分解和泰勒算法。即使使用保守的矩阵误差估计(出现无效概率),泰勒算法和特征分解算法都表现出相当高的错误率:所有外显子三核苷酸矩阵中约100%对泰勒算法是病态的,而密码子第1和第2位二核苷酸矩阵以及内含子三核苷酸矩阵中约10%,密码子矩阵中约30%对特征分解是病态的。泰勒算法的大多数错误源于出现多个未观察到的状态。在三核苷酸矩阵上从帕德算法检测到少量负概率,这归因于机器精度。虽然帕德算法不便于缓存中间结果,但在相同矩阵上它比特征分解快3倍。
开发用于计算不可逆二核苷酸、密码子和更高进化模型的稳健软件需要实现带缩放和平方的帕德算法。