Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae146.
Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.
蛋白质序列设计可以为生物制药和疾病治疗提供有价值的见解。目前,大多数基于深度学习的蛋白质序列设计方法都侧重于网络架构优化,而忽略了蛋白质特有的物理化学特征。受结构模板和预训练模型在蛋白质结构预测中的成功应用的启发,我们探索了结构序列特征的表示是否可以用于蛋白质序列设计。在这项工作中,我们提出了基于结构序列特征的蛋白质序列设计方法 SPDesign,该方法使用超快形状识别。给定输入的骨干结构,SPDesign 使用超快形状识别向量来加速在内部 PAcluster80 结构数据库中搜索相似的蛋白质结构,然后通过结构比对提取序列特征。结合结构预训练知识和几何特征,将它们进一步输入到增强图神经网络中进行序列预测。结果表明,SPDesign 显著优于最先进的方法,如 ProteinMPNN、Pifold 和 LM-Design,在 CATH 4.2 基准测试中分别使序列恢复率提高了 21.89%、15.54%和 11.4%。在具有较少同源序列的孤儿和从头开始(设计)基准测试中也取得了令人鼓舞的结果。此外,通过 PDBench 工具进行的分析表明,SPDesign 在细分结构中表现良好。更有趣的是,我们发现 SPDesign 可以很好地重建具有相似结构但不同序列的一些蛋白质的序列。最后,结构建模验证实验表明,SPDesign 设计的序列可以更准确地折叠成天然结构。