Rigoutsos I, Floratos A, Ouzounis C, Gao Y, Parida L
Computational Biology Center, Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA.
Proteins. 1999 Nov 1;37(2):264-77. doi: 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c.
Using Teiresias, a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets, cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction.
我们使用蒂雷西亚斯(Teiresias),一种模式发现方法,它能够识别任何给定蛋白质序列集中存在的所有基序,而无需比对或明确列举解空间,我们探索了GenPept序列数据库,并构建了一个包含所有具有两个或更多实例的序列模式的字典。这个字典的条目,此后称为序列单元(seqlets),覆盖了输入数据库中所有氨基酸位置的98.12%,本质上为蛋白质序列空间提供了一套全面的有限描述符。因此,序列单元可以有效地用于描述几乎每一个天然存在的蛋白质。事实上,序列单元可以被视为蛋白质分子的构建块,是功能或家族等价成员资格的必要(但不充分)条件。因此,序列单元既可以定义保守的家族特征,也可以跨越分子家族以及源自功能趋同的先前未检测到的序列信号。此外,我们表明序列单元还可以捕获结构保守基序。以这种无监督、分层方式导出的序列单元字典的可用性,为解决各种问题带来了新机会,这些问题包括可靠分类、序列片段与功能类别的相关性,以及用于同源性搜索、进化研究和蛋白质结构预测的更快、更灵敏的引擎。