Dharmsinh Desai University, Department of Computer Engineering, Faculty of Technology, D D University, Nadiad, 387001, India.
Research and Development Center, Faculty of Technology, Dharmsinh Desai University, Nadiad, 387001, India.
Comput Biol Chem. 2020 Feb;84:107164. doi: 10.1016/j.compbiolchem.2019.107164. Epub 2019 Nov 15.
At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein - Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhances prediction accuracy by 7%-10%. The subset of SCOPe 2.07 is trained using 65 dimensional feature vector using Random Forest classifier. The test result for the rest of the set gives consistent accuracy of better than 95%. The accuracy achieved on benchmark datasets ASTRAL 1.73, 25PDB and FC699 is better than 86%, 91% and 97% respectively, which is best reported to our knowledge.
目前,三级结构发现的增长率远远落后于一级结构的发现。使用机器学习技术预测蛋白质结构类别可以帮助缩小这一差距。蛋白质结构分类 - 扩展(SCOPe 2.07)是目前最新和最大的数据集。将彼此之间的序列同一性小于 40%的蛋白质序列用于预测 α、β、α/β 和 α+β SCOPe 类。从蛋白质的一级和二级结构表示中提取敏感特征。从二级结构中以其频率、音高和空间排列提取实验特征。基于一级结构的特征包含蛋白质序列的物种信息。使用 TaxId 进一步使用 uniref100 数据集验证物种参数。众所周知,蛋白质的三级结构是功能的表现。在物种中观察到功能差异。因此,预计物种与结构类别之间存在很强的相关性,这在当前的工作中得到了发现。它将预测精度提高了 7%-10%。使用随机森林分类器对 SCOPe 2.07 的子集进行了 65 维特征向量的训练。对其余部分的测试结果给出了一致的准确率超过 95%。在基准数据集 ASTRAL 1.73、25PDB 和 FC699 上取得的准确率分别优于 86%、91%和 97%,这是我们所知的最佳准确率。