Heger A, Holm L
Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK.
Bioinformatics. 2001 Mar;17(3):272-9. doi: 10.1093/bioinformatics/17.3.272.
Evolutionary classification leads to an economical description of protein sequence data because attributes of function and structure are inherited in protein families. This paper presents Picasso, a procedure for deriving a minimal set of protein family profiles that cover all known protein sequences.
Picasso starts from highly overlapping sequence neighbourhoods revealed by all-on-all pairwise Blast alignment. Overlaps are reduced by merging sequences or parts of sequences into multiple alignments. For maximum unification, the multiple alignments must reach into the twilight zone of sequence similarity. Sensitive and selective profile-profile comparison allows unification down to about 15% pairwise sequence identity. Families unified through a short conserved sequence motif are associated with multiple full-length alignments describing different subfamilies. Domains that are mobile modules are identified based on their association with different sets of neighbours. The result is 10000 unified domain families (excluding singletons) representing functionally related proteins and recovering classical prolific domain types in high numbers. The classification is useful, for example, in developing strategies for efficient database searching and for selecting targets to complete the map of all 3-D structures.
进化分类能够对蛋白质序列数据进行经济有效的描述,因为功能和结构属性在蛋白质家族中是可遗传的。本文介绍了Picasso,这是一种用于推导覆盖所有已知蛋白质序列的最小蛋白质家族谱集的程序。
Picasso从通过全对全两两Blast比对揭示的高度重叠的序列邻域开始。通过将序列或序列的部分合并到多序列比对中来减少重叠。为了实现最大程度的统一,多序列比对必须深入到序列相似性的模糊区域。灵敏且具有选择性的谱-谱比较允许统一到约15%的两两序列同一性。通过短保守序列基序统一的家族与描述不同亚家族的多个全长比对相关联。基于与不 同邻域集的关联来识别作为移动模块的结构域。结果是得到了10000个统一的结构域家族(不包括单例),它们代表功能相关的蛋白质并且大量恢复了经典的丰富结构域类型。这种分类例如在制定高效数据库搜索策略以及选择目标以完成所有三维结构图谱方面是有用的。