Institute of Systems Biology, Shanghai University, Shanghai 200444, China.
Biomed Res Int. 2013;2013:414327. doi: 10.1155/2013/414327. Epub 2013 Apr 22.
With a large number of disordered proteins and their important functions discovered, it is highly desired to develop effective methods to computationally predict protein disordered regions. In this study, based on Random Forest (RF), Maximum Relevancy Minimum Redundancy (mRMR), and Incremental Feature Selection (IFS), we developed a new method to predict disordered regions in proteins. The mRMR criterion was used to rank the importance of all candidate features. Finally, top 128 features were selected from the ranked feature list to build the optimal model, including 92 Position Specific Scoring Matrix (PSSM) conservation score features and 36 secondary structure features. As a result, Matthews correlation coefficient (MCC) of 0.3895 was achieved on the training set by 10-fold cross-validation. On the basis of predicting results for each query sequence by using the method, we used the scanning and modification strategy to improve the performance. The accuracy (ACC) and MCC were increased by 4% and almost 0.2%, respectively, compared with other three popular predictors: DISOPRED, DISOclust, and OnD-CRF. The selected features may shed some light on the understanding of the formation mechanism of disordered structures, providing guidelines for experimental validation.
随着大量无序蛋白质及其重要功能的发现,人们非常希望开发有效的方法来计算预测蛋白质无序区域。在这项研究中,我们基于随机森林 (RF)、最大相关性最小冗余度 (mRMR) 和增量特征选择 (IFS),开发了一种新的预测蛋白质无序区域的方法。使用 mRMR 准则对所有候选特征的重要性进行排序。最后,从排序的特征列表中选择前 128 个特征来构建最优模型,包括 92 个位置特异性评分矩阵 (PSSM) 保守评分特征和 36 个二级结构特征。结果,通过 10 倍交叉验证,在训练集上获得了 0.3895 的马修斯相关系数 (MCC)。基于对每个查询序列的预测结果,我们使用扫描和修改策略来提高性能。与其他三个流行的预测器(DISOPRED、DISOclust 和 OnD-CRF)相比,准确性 (ACC) 和 MCC 分别提高了 4%和近 0.2%。选择的特征可能有助于理解无序结构的形成机制,为实验验证提供指导。