使用CLUSTAL进行多序列比对。

Using CLUSTAL for multiple sequence alignments.

作者信息

Higgins D G, Thompson J D, Gibson T J

机构信息

European Molecular Biology Laboratory Outstation-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom.

出版信息

Methods Enzymol. 1996;266:383-402. doi: 10.1016/s0076-6879(96)66024-8.

We have tested CLUSTAL W in a wide variety of situations, and it is capable of handling some very difficult protein alignment problems. If the data set consists of enough closely related sequences so that the first alignments are accurate, then CLUSTAL W will usually find an alignment that is very close to ideal. Problems can still occur if the data set includes sequences of greatly different lengths or if some sequences include long regions that are impossible to align with the rest of the data set. Trying to balance the need for long insertions and deletions in some alignments with the need to avoid them in others is still a problem. The default values for our parameters were tested empirically using test cases of sets of globular proteins where some information as to the correct alignment was available. The parameter values may not be very appropriate with nonglobular proteins. We have argued that using one weight matrix and two gap penalties is too simplistic to be of general use in the most difficult cases. We have replaced these parameters with a large number of new parameters designed primarily to help encourage gaps in loop regions. Although these new parameters are largely heuristic in nature, they perform surprisingly well and are simple to implement. The underlying speed of the progressive alignment approach is not adversely affected. The disadvantage is that the parameter space is now huge; the number of possible combinations of parameters is more than can easily be examined by hand. We justify this by asking the user to treat CLUSTAL W as a data exploration tool rather than as a definitive analysis method. It is not sensible to automatically derive multiple alignments and to trust particular algorithms as being capable of always getting the correct answer. One must examine the alignments closely, especially in conjunction with the underlying phylogenetic tree (or estimate of it) and try varying some of the parameters. Outliers (sequences that have no close relatives) should be aligned carefully, as should fragments of sequences. The program will automatically delay the alignment of any sequences that are less than 40% identical to any others until all other sequences are aligned, but this can be set from a menu by the user. It may be useful to build up an alignment of closely related sequences first and to then add in the more distant relatives one at a time or in batches, using the profile alignments and weighting scheme described earlier and perhaps using a variety of parameter settings. We give one example using SH2 domains. SH2 domains are widespread in eukaryotic signalling proteins where they function in the recognition of phosphotyrosine-containing peptides. In the chapter by Bork and Gibson ([11], this volume), Blast and pattern/profile searches were used to extract the set of known SH2 domains and to search for new members. (Profiles used in database searches are conceptually very similar to the profiles used in CLUSTAL W: see the chapters [11] and [13] for profile search methods.) The profile searches detected SH2 domains in the JAK family of protein tyrosine kinases, which were thought not to contain SH2 domains. Although the JAK family SH2 domains are rather divergent, they have the necessary core structural residues as well as the critical positively charged residue that binds phosphotyrosine, leaving no doubt that they are bona fide SH2 domains. The five new JAK family SH2 domains were added sequentially to the existing alignment of 65 SH2 domains using the CLUSTAL W profile alignment option. Figure 6 shows part of the resulting alignment. Despite their divergent sequences, the new SH2 domains have been aligned nearly perfectly with the old set. No insertions were placed in the original SH2 domains. In this example, the profile alignment procedure has produced better results than a one-step full alignment of all 70 SH2 domains, and in considerably less time. (ABSTRACT TRUNCATED)

我们在各种情况下对CLUSTAL W进行了测试，它能够处理一些非常困难的蛋白质比对问题。如果数据集包含足够多密切相关的序列，使得初始比对是准确的，那么CLUSTAL W通常会找到一个非常接近理想的比对。如果数据集包含长度差异很大的序列，或者某些序列包含与数据集中其他序列无法比对的长区域，问题仍然可能出现。在一些比对中平衡长插入和缺失的需求与在其他比对中避免它们的需求仍然是个问题。我们通过使用一组球状蛋白质的测试案例，凭经验测试了参数的默认值，这些测试案例有一些关于正确比对的信息。这些参数值对于非球状蛋白质可能不太合适。我们认为，在最困难的情况下，使用一个权重矩阵和两个空位罚分过于简单，不具有普遍适用性。我们用大量新参数取代了这些参数，这些新参数主要是为了帮助鼓励在环区出现空位。尽管这些新参数在很大程度上是启发式的，但它们的表现出人意料地好，并且易于实现。渐进比对方法的基本速度没有受到不利影响。缺点是参数空间现在非常大；参数的可能组合数量超过了手工轻松检查的范围。我们通过要求用户将CLUSTAL W视为一种数据探索工具而非确定性分析方法来证明这一点。自动推导多个比对并相信特定算法总能得到正确答案是不明智的。必须仔细检查比对结果，特别是结合基础系统发育树（或其估计），并尝试改变一些参数。异常值（没有近亲的序列）以及序列片段都应该仔细比对。程序会自动延迟比对任何与其他序列相似度低于40%的序列，直到所有其他序列都比对完成，但用户可以通过菜单设置这一参数。先构建密切相关序列的比对，然后一次一个或一批地添加较远的亲属序列，使用前面描述的轮廓比对和加权方案，也许还使用各种参数设置，可能会很有用。我们给出一个使用SH2结构域的例子。SH2结构域广泛存在于真核生物信号蛋白中，它们在识别含磷酸酪氨酸的肽中发挥作用。在Bork和Gibson（[11]，本卷）的章节中，使用Blast和模式/轮廓搜索来提取已知的SH2结构域集并寻找新成员。（数据库搜索中使用的轮廓在概念上与CLUSTAL W中使用的轮廓非常相似：有关轮廓搜索方法，请参见章节[11]和[13]。）轮廓搜索在蛋白酪氨酸激酶的JAK家族中检测到了SH2结构域，而之前认为该家族不包含SH2结构域。尽管JAK家族的SH2结构域差异较大，但它们具有必要的核心结构残基以及结合磷酸酪氨酸的关键带正电荷残基，这毫无疑问地表明它们是真正的SH2结构域。使用CLUSTAL W轮廓比对选项将五个新的JAK家族SH2结构域依次添加到现有的65个SH2结构域的比对中。图6显示了部分比对结果。尽管它们的序列不同，但新的SH2结构域与旧的结构域集几乎完美比对。原始的SH2结构域中没有插入。在这个例子中，轮廓比对程序比一次性对所有70个SH2结构域进行完全比对产生了更好的结果，而且用时少得多。（摘要截断）

Using CLUSTAL for multiple sequence alignments.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献