Sammut Stephen John, Finn Robert D, Bateman Alex
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SA, UK.
Brief Bioinform. 2008 May;9(3):210-9. doi: 10.1093/bib/bbn010. Epub 2008 Mar 15.
Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.
将蛋白质分类为相关序列组在某些方面类似于生物学的元素周期表,使我们能够理解任何生物体的潜在分子生物学。Pfam是一个庞大的蛋白质结构域和家族集合。其科学目标是提供蛋白质家族和结构域的完整且准确的分类。该数据库的下一个版本将包含超过10000个条目,这促使我们思考距离完成这项工作还有多远。目前,Pfam与72%的已知蛋白质序列匹配,但对于具有已知结构的蛋白质,Pfam的匹配率为95%,我们认为这代表了可能的上限。根据我们的分析,要使当前序列数据库达到这一覆盖水平,还需要另外28000个家族。我们还表明,随着更多序列添加到序列数据库中,Pfam匹配的序列比例会降低,这表明持续添加新家族对于维持其相关性至关重要。