Zhu Mingming, Song Yidong, Yuan Qianmu, Yang Yuedong
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China.
High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, 518000, China.
Commun Biol. 2024 Dec 29;7(1):1709. doi: 10.1038/s42003-024-07436-3.
Proteins derived from microorganisms that survive in the harshest environments on Earth have stable activity under extreme conditions, providing rich resources for industrial applications and enzyme engineering. Due to the time-consuming nature of experimental determinations, it is imperative to develop computational models for fast and accurate prediction of protein optimal conditions. Previous studies were limited by the scarcity of data and the neglect of protein structures. To solve these problems, we constructed an up-to-date dataset with 175,905 non-redundant proteins and proposed a new model GeoPoc based on geometric graph learning for the protein optimal temperature, pH, and salt concentration prediction. GeoPoc leverages protein structures and sequence embeddings extracted from pre-trained language model, and further employs a geometric graph transformer network to capture the sequence and spatial information. We first focused on in-house validation for optimal temperature prediction for robustness assessment, and achieved a PCC of 0.78. The algorithm is further confirmed in an independent test set, where GeoPoc surpasses the state-of-the-art method by 2.3% in AUC. Additionally, GeoPoc was extended to pH and salt concentration prediction, and obtained AUC scores of 0.78 and 0.77, respectively. Through further interpretable analysis, GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability.
源自能在地球上最恶劣环境中生存的微生物的蛋白质,在极端条件下具有稳定的活性,为工业应用和酶工程提供了丰富的资源。由于实验测定耗时,开发用于快速准确预测蛋白质最佳条件的计算模型势在必行。以往的研究受到数据稀缺和对蛋白质结构忽视的限制。为了解决这些问题,我们构建了一个包含175,905个非冗余蛋白质的最新数据集,并提出了一种基于几何图学习的新模型GeoPoc,用于预测蛋白质的最佳温度、pH值和盐浓度。GeoPoc利用从预训练语言模型中提取的蛋白质结构和序列嵌入,并进一步采用几何图变换器网络来捕捉序列和空间信息。我们首先专注于内部验证以进行最佳温度预测的稳健性评估,获得了0.78的皮尔逊相关系数(PCC)。该算法在独立测试集中得到进一步验证,其中GeoPoc在曲线下面积(AUC)方面比最先进的方法高出2.3%。此外,GeoPoc被扩展到pH值和盐浓度预测,分别获得了0.78和0.77的AUC分数。通过进一步的可解释分析,GeoPoc阐明了有助于提高蛋白质热稳定性的关键物理化学性质。