Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy.
Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
Phys Rev E. 2020 Dec;102(6-1):062409. doi: 10.1103/PhysRevE.102.062409.
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e., arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position specificities like conservation in sequences but assume an independent evolution of different positions. Over recent years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles, and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.
核苷酸(用于 DNA 和 RNA)或氨基酸(用于蛋白质)序列是生物学中的核心对象。在最重要的计算问题中,序列比对问题尤为突出,即通过某种方式排列来自不同生物体的序列,以识别相似区域,检测序列之间的进化关系,并预测生物分子的结构和功能。这通常通过轮廓模型来解决,该模型可以捕获序列中特定位置的保守性,但假定不同位置的独立进化。近年来,已经充分证明不同氨基酸位置的共进化对于维持三维结构和功能至关重要。基于逆统计物理学的建模方法可以捕捉序列集合中的共进化信号,并且现在广泛用于预测蛋白质结构、蛋白质-蛋白质相互作用和突变景观。在这里,我们提出了 DCAlign,这是一种基于近似消息传递策略的高效对齐算法,它能够克服轮廓模型的局限性,以通用的方式包括位置之间的共进化,因此无需使用互补结构信息即可普遍适用于蛋白质和 RNA 序列对齐。我们使用精心控制的模拟数据以及真实的蛋白质和 RNA 序列仔细探索了 DCAlign 的潜力。