KAPOW, Departamento de Computación , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-ICC , Buenos Aires , Argentina.
Protein Physiology Lab, Departamento de Química Biológica , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN , Buenos Aires , Argentina.
J Phys Chem B. 2018 Dec 13;122(49):11295-11301. doi: 10.1021/acs.jpcb.8b07206. Epub 2018 Oct 8.
All known terrestrial proteins are coded as continuous strings of ≈20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for "repetition", an efficient algorithmic implementation and a robust scoring system with no adjustable parameters. We show that the sequence patterns can be well-separated into disjoint classes according to their recurrence in nested structures. The statistics of the occurrences of patterns indicate that short repetitions are sufficient to account for the differences between natural families and randomized groups of sequences by more than 10 standard deviations, while contiguous sequence patterns shorter than 5 residues are effectively random in their occurrences. A small subset of patterns is sufficient to account for a robust "familiarity" definition between arbitrary sets of sequences.
所有已知的陆地蛋白质都被编码为 ≈20 个氨基酸的连续字符串。在有限序列组中元素重复形成的模式描述了蛋白质家族的自然结构。我们提出了一种使用“重复”的数学精确定义、有效的算法实现和没有可调参数的稳健评分系统在蛋白质序列中搜索模式和模式分组的方法。我们表明,序列模式可以根据它们在嵌套结构中的重复情况很好地分为不相交的类。模式出现的统计数据表明,短重复足以解释自然家族和随机序列组之间的差异,超过 10 个标准差,而连续的序列模式短于 5 个残基在出现时是有效的随机的。一小部分模式足以解释任意序列集之间稳健的“熟悉度”定义。