Yu L, White J V, Smith T F
BioMolecular Engineering Research Center, College of Engineering, Boston University, Massachusetts 02215, USA.
Protein Sci. 1998 Dec;7(12):2499-510. doi: 10.1002/pro.5560071203.
A new method is presented for identifying distantly related homologous proteins that are unrecognizable by conventional sequence comparison methods. The method combines information about functionally conserved sequence patterns with information about structure context. This information is encoded in stochastic discrete state-space models (DSMs) that comprise a new family of hidden Markov models. The new models are called sequence-pattern-embedded DSMs (pDSMs). This method can identify distantly related protein family members with a high sensitivity and specificity. The method is illustrated with trypsin-like serine proteases and globins. The strategy for building pDSMs is presented. The method has been validated using carefully constructed positive and negative control sets. In addition to the ability to recognize remote homologs, pDSM sequence analysis predicts secondary structures with higher sensitivity, specificity, and Q3 accuracy than DSM analysis, which omits information about conserved sequence patterns. The identification of trypsin-like serine proteases in new genomes is discussed.
本文提出了一种新方法,用于识别传统序列比较方法无法识别的远缘同源蛋白。该方法将功能保守序列模式的信息与结构背景信息相结合。这些信息编码在随机离散状态空间模型(DSM)中,该模型构成了一个新的隐马尔可夫模型家族。新模型称为序列模式嵌入DSM(pDSM)。该方法能够以高灵敏度和特异性识别远缘相关的蛋白质家族成员。以类胰蛋白酶丝氨酸蛋白酶和球蛋白为例对该方法进行了说明。介绍了构建pDSM的策略。该方法已通过精心构建的阳性和阴性对照组进行了验证。除了识别远缘同源物的能力外,pDSM序列分析预测二级结构的灵敏度、特异性和Q3准确性均高于DSM分析,后者忽略了保守序列模式的信息。文中还讨论了在新基因组中识别类胰蛋白酶丝氨酸蛋白酶的问题。