Ezawa Kiyoshi
Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, 820-8502, Japan.
Department of Biology and Biochemistry, University of Houston, Houston, TX, 77204-5001, USA.
BMC Bioinformatics. 2016 Sep 27;17(1):397. doi: 10.1186/s12859-016-1167-6.
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a method to reliably calculate the occurrence probabilities of sequence alignments via evolutionary processes on an entire sequence. Previously, we presented a perturbative formulation that facilitates the ab initio calculation of alignment probabilities under a continuous-time Markov model, which describes the stochastic evolution of an entire sequence via indels with quite general rate parameters. And we demonstrated that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) delimited by gapless columns.
Here, using our formulation, we attempt to approximately calculate the probabilities of local alignments under space-homogeneous cases. First, for each of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs), we numerically computed the total contribution from all parsimonious indel histories and that from all next-parsimonious histories, and compared them. Second, for some common types of local PWAs, we derived two integral equation systems that can be numerically solved to give practically exact solutions. We compared the total parsimonious contribution with the practically exact solution for each such local PWA. Third, we developed an algorithm that calculates the first-approximate MSA probability by multiplying total parsimonious contributions from all local MSAs. Then we compared the first-approximate probability of each local MSA with its absolute frequency in the MSAs created via a genuine sequence evolution simulator, Dawg. In all these analyses, the total parsimonious contributions approximated the multiplication factors fairly well, as long as gap sizes and branch lengths are at most moderate. Examination of the accuracy of another indel probabilistic model in the light of our formulation indicated some modifications necessary for the model's accuracy improvement.
At least under moderate conditions, the approximate methods can quite accurately calculate ab initio alignment probabilities under biologically more realistic models than before. Thus, our formulation will provide other indel probabilistic models with a sound reference point.
插入和缺失(indels)在两个相关DNA序列之间造成的核苷酸差异比替换更多,因此必须开发一种方法,通过对整个序列进行进化过程来可靠地计算序列比对的发生概率。此前,我们提出了一种微扰公式,便于在连续时间马尔可夫模型下从头计算比对概率,该模型通过具有相当一般速率参数的indels描述整个序列的随机进化。并且我们证明,在某些条件下,比对的从头概率可以分解为一个总体因子与由无间隙列界定的区域(或局部比对)的贡献的乘积。
在这里,使用我们的公式,我们尝试在空间均匀的情况下近似计算局部比对的概率。首先,对于所有类型的局部两两比对(PWAs)以及一些典型类型的局部多序列比对(MSAs),我们通过数值计算了所有简约indel历史的总贡献以及所有次简约历史的总贡献,并进行了比较。其次,对于一些常见类型的局部PWAs,我们推导了两个积分方程组,可以通过数值求解得到实际精确解。我们将每个此类局部PWA的简约总贡献与实际精确解进行了比较。第三,我们开发了一种算法,通过将所有局部MSA的简约总贡献相乘来计算第一近似MSA概率。然后我们将每个局部MSA的第一近似概率与其在通过真实序列进化模拟器Dawg创建的MSA中的绝对频率进行了比较。在所有这些分析中,只要间隙大小和分支长度至多适中,简约总贡献就相当好地近似了乘法因子。根据我们的公式对另一个indel概率模型的准确性进行检验表明,该模型需要进行一些修改以提高准确性。
至少在适中的条件下,这些近似方法能够比以前更准确地在生物学上更现实的模型下计算从头比对概率。因此,我们的公式将为其他indel概率模型提供一个可靠的参考点。