Chen Jianwen, Zheng Shuangjia, Zhao Huiying, Yang Yuedong
School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.
Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, China.
J Cheminform. 2021 Feb 8;13(1):7. doi: 10.1186/s13321-021-00488-1.
Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent [Formula: see text] of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.
蛋白质溶解度对于产生新的可溶性蛋白质具有重要意义,这些新的可溶性蛋白质可以降低生物催化剂或治疗剂的成本。因此,迫切需要一种计算模型来根据氨基酸序列准确预测蛋白质溶解度。已经开发了许多方法,但它们大多基于氨基酸的一维嵌入,这种方法在捕捉空间结构信息方面存在局限性。在本研究中,我们开发了一种新的结构感知方法GraphSol,通过注意力图卷积网络(GCN)预测蛋白质溶解度,其中蛋白质拓扑属性图仅通过从序列预测的接触图构建。结果表明,GraphSol显著优于其他基于序列的方法。在eSOL数据集的交叉验证和独立测试中,该模型的一致性[公式:见正文]为0.48,证明是稳定的。据我们所知,这是第一项利用GCN进行基于序列的蛋白质溶解度预测的研究。更重要的是,这种架构可以很容易地扩展到其他需要原始蛋白质序列的蛋白质预测任务。