Maiolo Massimo, Ulzega Simone, Gil Manuel, Anisimova Maria
Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), CH-8820 Wädenswil, Switzerland.
NAR Genom Bioinform. 2020 Nov 6;2(4):lqaa092. doi: 10.1093/nargab/lqaa092. eCollection 2020 Dec.
Recently we presented a frequentist dynamic programming (DP) approach for multiple sequence alignment based on the explicit model of indel evolution Poisson Indel Process (PIP). This phylogeny-aware approach produces evolutionary meaningful gap patterns and is robust to the 'over-alignment' bias. Despite linear time complexity for the computation of marginal likelihoods, the overall method's complexity is cubic in sequence length. Inspired by the popular aligner MAFFT, we propose a new technique to accelerate the evolutionary indel based alignment. Amino acid sequences are converted to sequences representing their physicochemical properties, and homologous blocks are identified by multi-scale short-time Fourier transform. Three three-dimensional DP matrices are then created under PIP, with homologous blocks defining sparse structures where most cells are excluded from the calculations. The homologous blocks are connected through intermediate 'linking blocks'. The homologous and linking blocks are aligned under PIP as independent DP sub-matrices and their tracebacks merged to yield the final alignment. The new algorithm can largely profit from parallel computing, yielding a theoretical speed-up estimated to be proportional to the cubic power of the number of sub-blocks in the DP matrices. We compare the new method to the original PIP approach and demonstrate it on real data.
最近,我们基于插入缺失进化的显式模型——泊松插入缺失过程(PIP),提出了一种用于多序列比对的频率主义动态规划(DP)方法。这种系统发育感知方法产生了具有进化意义的空位模式,并且对“过度比对”偏差具有鲁棒性。尽管计算边际似然的时间复杂度为线性,但该整体方法的复杂度在序列长度上是立方级的。受流行的比对工具MAFFT启发,我们提出了一种新技术来加速基于进化插入缺失的比对。氨基酸序列被转换为代表其物理化学性质的序列,并通过多尺度短时傅里叶变换识别同源块。然后在PIP下创建三个三维DP矩阵,同源块定义了稀疏结构,其中大多数单元格被排除在计算之外。同源块通过中间的“连接块”相连。同源块和连接块在PIP下作为独立的DP子矩阵进行比对,并将它们的回溯合并以产生最终比对。新算法可以在很大程度上受益于并行计算,理论上的加速比估计与DP矩阵中子块数量的立方成正比。我们将新方法与原始的PIP方法进行比较,并在实际数据上进行了演示。