Yadav Amisha, Sahu Roopshikha, Nath Abhigyan
Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India.
Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India.
Comput Biol Chem. 2020 May 5;87:107274. doi: 10.1016/j.compbiolchem.2020.107274.
Growth hormone binding proteins (GHBPs) are soluble proteins that play an important role in the modulation of signaling pathways pertaining to growth hormones. GHBPs are selective and bind non-covalently with growth hormones, but their functions are still not fully understood. Identification and characterization of GHBPs are the preliminary steps for understanding their roles in various cellular processes. As wet lab based experimental methods involve high cost and labor, computational methods can facilitate in narrowing down the search space of putative GHBPs. Performance of machine learning algorithms largely depends on the quality of features that it feeds on. Informative and non-redundant features generally result in enhanced performance and for this purpose feature selection algorithms are commonly used. In the present work, a novel representation transfer learning approach is presented for prediction of GHBPs. For their accurate prediction, deep autoencoder based features were extracted and subsequently SMO-PolyK classifier is trained. The prediction model is evaluated by both leave one out cross validation (LOOCV) and hold out independent testing set. On LOOCV, the prediction model achieved 89.8%% accuracy, with 89.4% sensitivity and 90.2% specificity and accuracy of 93.5%, sensitivity of 90.2% and specificity of 96.8% is attained on the hold out testing set. Further a comparison was made between the full set of sequence-based features, top performing sequence features extracted using feature selection algorithm, deep autoencoder based features and generalized low rank model based features on the prediction accuracy. Principal component analysis of the representative features along with t-sne visualization demonstrated the effectiveness of deep features in prediction of GHBPs. The present method is robust and accurate and may complement other wet lab based methods for identification of novel GHBPs.
生长激素结合蛋白(GHBPs)是可溶性蛋白,在与生长激素相关的信号通路调节中发挥重要作用。GHBPs具有选择性,能与生长激素非共价结合,但其功能仍未完全明确。GHBPs的鉴定和表征是了解其在各种细胞过程中作用的初步步骤。由于基于湿实验室的实验方法成本高且耗力,计算方法有助于缩小假定GHBPs的搜索空间。机器学习算法的性能很大程度上取决于其输入特征的质量。信息丰富且非冗余的特征通常会带来性能提升,为此通常使用特征选择算法。在本研究中,提出了一种用于预测GHBPs的新型表示迁移学习方法。为了进行准确预测,提取了基于深度自动编码器的特征,随后训练了SMO-PolyK分类器。通过留一法交叉验证(LOOCV)和留出独立测试集对预测模型进行评估。在LOOCV中,预测模型的准确率达到89.8%,灵敏度为89.4%,特异性为90.2%;在留出测试集上,准确率为93.5%,灵敏度为90.2%,特异性为96.8%。此外,还比较了基于序列的完整特征集、使用特征选择算法提取的表现最佳的序列特征、基于深度自动编码器的特征以及基于广义低秩模型的特征在预测准确率方面的差异。对代表性特征进行主成分分析并结合t-sne可视化,证明了深度特征在预测GHBPs方面的有效性。本方法稳健且准确,可作为其他基于湿实验室的方法的补充,用于鉴定新型GHBPs。