Zhong Jiancheng, Zou Zhiwei, Qiu Jie, Wang Shaokai
College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China.
Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf156.
In the realm of protein design, the efficient construction of protein sequences that accurately fold into predefined structures has become an important area of research. Although advancements have been made in the study of long-chain proteins, the design of short-chain proteins requires equal consideration. The structural information inherent in short and single chains is typically less comprehensive than that of full-length chains, which can negatively impact their performance. To address this challenge, we introduce ScFold, a novel model that incorporates an innovative node module. This module utilizes spatial dimensionality reduction and positional encoding mechanisms to enhance the extraction of structural features. Experimental results indicate that ScFold achieves a recovery rate of 52.22$%$ on the CATH4.2 dataset, demonstrating notable efficacy for short-chain proteins, with a recovery rate of 41.6$%$. Additionally, ScFold further exhibits enhanced recovery rates of 59.32$%$ and 61.59$%$ on the TS50 and TS500 datasets, respectively, demonstrating its effectiveness across diverse protein types. Additionally, we performed protein length stratification on the TS500 and CATH4.2 datasets and tested ScFold on length-specific sub-datasets. The results confirm the model's superiority in handling short-chain proteins. Finally, we selected several protein sequence groups from the CATH4.2 dataset for structural visualization analysis and provided comparisons between the model-generated sequences and the target sequences.
在蛋白质设计领域,高效构建能准确折叠成预定义结构的蛋白质序列已成为一个重要的研究领域。尽管在长链蛋白质研究方面取得了进展,但短链蛋白质的设计也需要同等关注。短链和单链中固有的结构信息通常不如全长链全面,这可能会对它们的性能产生负面影响。为应对这一挑战,我们引入了ScFold,这是一种包含创新节点模块的新型模型。该模块利用空间降维和位置编码机制来增强结构特征的提取。实验结果表明,ScFold在CATH4.2数据集上的恢复率达到52.22%,对短链蛋白质显示出显著效果,其恢复率为41.6%。此外,ScFold在TS50和TS500数据集上分别进一步展现出59.32%和61.59%的更高恢复率,证明了其在不同蛋白质类型中的有效性。此外,我们对TS500和CATH4.2数据集进行了蛋白质长度分层,并在特定长度的子数据集上测试了ScFold。结果证实了该模型在处理短链蛋白质方面的优越性。最后,我们从CATH4.2数据集中选择了几个蛋白质序列组进行结构可视化分析,并提供了模型生成序列与目标序列之间的比较。