Boberg J, Salakoski T, Vihinen M
Protein Eng. 1995 May;8(5):501-3. doi: 10.1093/protein/8.5.501.
The selection of unbiased representatives from a large database is complicated by the requirement for the chosen entries to be not only genuinely different from each other but also typical for the family of related entries. A method satisfying this 2-fold objective was developed by equipping complete linkage clustering with a novel noise elimination procedure to deal with overlapping cluster structure. A total of 200 nuclear families of truly related Brookhaven Protein Data Bank structures were generated, from which any entry can be chosen to represent its family.
从大型数据库中选择无偏代表性数据很复杂,因为所选条目不仅要彼此真正不同,而且要在相关条目的类别中具有代表性。通过为完全链接聚类配备一种新颖的噪声消除程序来处理重叠聚类结构,开发了一种满足这双重目标的方法。总共生成了200个真正相关的布鲁克海文蛋白质数据库结构的核心家族,从中可以选择任何条目来代表其家族。