Zhejiang Lab, Zhejiang, China.
Dalian University of Technology, Liaoning, China.
BMC Bioinformatics. 2024 May 4;25(1):176. doi: 10.1186/s12859-024-05771-0.
Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed.
In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method.
Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.
蛋白质残基距离图被用于远程同源检测、蛋白质信息估计和蛋白质结构研究。然而,现有的预测方法耗时,并且每年都会发现数亿种蛋白质,因此需要开发一种快速可靠的蛋白质残基距离预测方法。此外,由于许多蛋白质缺乏已知的同源序列,因此需要一种无等待和无对齐的深度学习方法。
在本研究中,我们提出了一个名为 FreeProtMap 的学习框架。在蛋白质表示处理方面,FreeProtMap 中的分组池化有效地解决了蛋白质表示中高维稀疏性带来的问题。在模型结构方面,我们进行了一些精心的设计。首先,它基于蛋白质结构的局部性和三角不等式距离约束来提高预测精度。其次,通过使用加性注意力和轻量级设计来提高推理速度。此外,通过使用瓶颈和名为局部微整形器的神经网络块来提高泛化能力。因此,FreeProtMap 可以在数十毫秒内预测蛋白质残基距离,并且比最佳结构预测方法具有更高的精度。
几组对比实验和消融实验验证了设计的有效性。结果表明,FreeProtMap 在准确的蛋白质残基距离预测方面明显优于其他最先进的方法,这有利于许多蛋白质研究工作。值得一提的是,我们可以基于 FreeProtMap 扫描每年发现的所有蛋白质,以便在短时间内找到结构相似的蛋白质,因为基于距离图的结构相似性计算方法比基于 3D 结构的算法耗时少得多。