Guo Yuzhi, Wu Jiaxiang, Ma Hehuan, Wang Sheng, Huang Junzhou
Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA.
Tencent AI Lab, Shenzhen, China.
J Comput Biol. 2021 Apr;28(4):346-361. doi: 10.1089/cmb.2020.0416. Epub 2021 Feb 22.
Accurate predictions of protein structure properties, for example, secondary structure and solvent accessibility, are essential in analyzing the structure and function of a protein. Position-specific scoring matrix (PSSM) features are widely used in the structure property prediction. However, some proteins may have low-quality PSSM features due to insufficient homologous sequences, leading to limited prediction accuracy. To address this limitation, we propose an enhancing scheme for PSSM features. We introduce the "Bagging MSA" (multiple sequence alignment) method to calculate PSSM features used to train our model, adopt a convolutional network to capture local context features and bidirectional long short-term memory for long-term dependencies, and integrate them under an unsupervised framework. Structure property prediction models are then built upon such enhanced PSSM features for more accurate predictions. Moreover, we develop two frameworks to evaluate the effectiveness of the enhanced PSSM features, which also bring proposed method into real-world scenarios. Empirical evaluation of CB513, CASP11, and CASP12 data sets indicates that our unsupervised enhancing scheme indeed generates more informative PSSM features for structure property prediction.
准确预测蛋白质的结构特性,例如二级结构和溶剂可及性,对于分析蛋白质的结构和功能至关重要。位置特异性得分矩阵(PSSM)特征在结构特性预测中被广泛使用。然而,由于同源序列不足,一些蛋白质可能具有质量较低的PSSM特征,导致预测准确性有限。为了解决这一局限性,我们提出了一种PSSM特征增强方案。我们引入“Bagging MSA”(多序列比对)方法来计算用于训练我们模型的PSSM特征,采用卷积网络来捕获局部上下文特征,并使用双向长短期记忆来处理长期依赖性,并在无监督框架下将它们整合起来。然后基于这种增强的PSSM特征构建结构特性预测模型,以进行更准确的预测。此外,我们开发了两个框架来评估增强的PSSM特征的有效性,这也将所提出的方法带入实际应用场景。对CB513、CASP11和CASP12数据集的实证评估表明,我们的无监督增强方案确实为结构特性预测生成了更多信息丰富的PSSM特征。