Elofsson A, Sonnhammer E L
Department of Biochemistry, Stockholm University, 106 91 Stockholm, Sweden.
Bioinformatics. 1999 Jun;15(6):480-500. doi: 10.1093/bioinformatics/15.6.480.
Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs.
We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.
蛋白质家族可基于结构或序列相似性来定义。我们希望比较两个蛋白质家族数据库,一个基于结构相似性,另一个基于序列相似性,以研究它们的重叠程度、相应家族定义的相似性,并创建一个具有未知结构的大型蛋白质家族列表,作为结构基因组学的资源。我们还希望通过利用蛋白质家族隐马尔可夫模型(HMM)来提高折叠分配的灵敏度。
我们将基于序列相似性的蛋白质家族数据库Pfam与基于结构相似性的Scop进行了比较。我们发现70%的Scop家族存在于Pfam中,而57%的Pfam家族存在于Scop中。两个数据库中都出现的大多数家族彼此对应良好,但在某些情况下它们有所不同。这些情况突出了结构和序列方法存在显著差异的情形。该比较使我们能够编制一份未出现在Scop中的最大家族列表;这些家族是结构预测和确定的合适目标,可能有助于指导结构基因组学项目。可以注意到,20个最大的无已知结构的蛋白质家族中有13个可能是跨膜蛋白。我们还通过将查询序列与对应于Scop家族的Pfam HMM进行比较,利用Pfam提高检测已知结构蛋白质同源物的灵敏度。对于SWISSPROT + TREMBL,与仅使用FASTA相比,这使得折叠分配从31%提高到了42%。这种方法为酿酒酵母中22%的蛋白质、大肠杆菌中24%的蛋白质以及詹氏甲烷球菌中16%的蛋白质分配了结构。