Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, United States.
Department of Computer Science, Tufts University, 177 College Avenue, Medford, MA 02155, United States.
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad663.
High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di).
We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide.
TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.
现在已经预先计算出了高质量的计算结构模型,并且几乎可以在 UniProt 中的每个蛋白质中都可以使用。然而,以高通量的方式利用这些模型来预测哪些蛋白质对相互作用的最佳方法还不明确。van Kempen 等人最近提出的 Foldseek 方法将距离和角度的结构信息编码为与蛋白质字符串长度相同的线性字符串,使用来自 21 字母离散化结构字母表(3Di)的标记。
我们表明,将氨基酸序列和 Foldseek 生成的 3Di 序列作为输入,用于我们最近的深度学习方法 Topsy-Turvy,可显著提高跨物种预测蛋白质-蛋白质相互作用的性能。因此,TT3D(Topsy-Turvy 3D)提供了一种方法,可以重复使用从序列生成高质量结构模型所投入的所有计算工作,同时又足够轻量级,可以在全基因组范围内对所有蛋白质对进行高质量的二进制蛋白质-蛋白质相互作用预测。
TT3D 可在 https://github.com/samsledje/D-SCRIPT 上获得。提交时的代码存档版本可在 https://zenodo.org/records/10037674 上找到。