统计比对：计算属性、同源性检测与拟合优度

Statistical alignment: computational properties, homology testing and goodness-of-fit.

作者信息

Hein J, Wiuf C, Knudsen B, Møller M B, Wibling G

机构信息

Department of Genetics and Ecology The Institute of Biological Science, University of Aarhus, Building 540, Ny Munkegade, Arhus C, 8000, Denmark.

出版信息

J Mol Biol. 2000 Sep 8;302(1):265-79. doi: 10.1006/jmbi.2000.4061.

DOI:10.1006/jmbi.2000.4061

PMID:10964574

Abstract

The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model.Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis.Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins.Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model.

摘要

生物序列中插入和缺失的模型最早由索恩、岸野和费尔斯滕森于1991年提出（TKF91模型），为在统计框架内进行比对提供了基础。在此我们对该模型进行研究。首先，我们展示了如何将统计比对算法加速几个数量级。主要创新点在于将似然计算限制在接近基于相似性的比对的一个条带内，获得进化参数的良好初始猜测，并应用一种高效的数值优化算法来找到最大似然估计。此外，索恩、岸野和费尔斯滕森最初提出的递归可以简化。在一台快速的台式计算机上，用这种方法可以在不到五秒的时间内分析两条长度约为1500个氨基酸的蛋白质，这使得该方法对于实际数据分析具有实用性。其次，我们基于此模型提出了一种新的同源性测试，其中同源性意味着可以在有限的时间回溯中找到序列对的一个祖先。相对于传统的蛋白质重排测试，该测试具有统计优势。最后，我们描述了一种拟合优度测试，它允许对该模型固有的插入 - 缺失（indel）过程进行测试，并发现真实序列（这里是珠蛋白）可能经历长度大于一个的indel，这与该模型的假设相反。