Chen Dong, Liu Jian, Wei Guo-Wei
Department of Mathematics, Michigan State University, East Lansing, MI, USA.
Mathematical Science Research Center, Chongqing University of Technology, Chongqing, China.
Nat Mach Intell. 2024 Jul;6(7):799-810. doi: 10.1038/s42256-024-00855-1. Epub 2024 Jun 21.
Despite the success of pretrained natural language processing (NLP) models in various fields, their application in computational biology has been hindered by their reliance on biological sequences, which ignores vital three-dimensional (3D) structural information incompatible with the sequential architecture of NLP models. Here we present a topological transformer (TopoFormer), which is built by integrating NLP models and a multiscale topology technique, the persistent topological hyperdigraph Laplacian (PTHL), which systematically converts intricate 3D protein-ligand complexes at various spatial scales into an NLP-admissible sequence of topological invariants and homotopic shapes. PTHL systematically transforms intricate 3D protein-ligand complexes into NLP-compatible sequences of topological invariants and shapes, capturing essential interactions across spatial scales. TopoFormer gives rise to exemplary scoring accuracy and excellent performance in ranking, docking and screening tasks in several benchmark datasets. This approach can be utilized to convert general high-dimensional structured data into NLP-compatible sequences, paving the way for broader NLP based research.
尽管预训练自然语言处理(NLP)模型在各个领域都取得了成功,但其在计算生物学中的应用却受到了阻碍,因为它们依赖生物序列,而忽略了与NLP模型序列架构不兼容的重要三维(3D)结构信息。在此,我们提出了一种拓扑变换器(TopoFormer),它是通过整合NLP模型和一种多尺度拓扑技术——持久拓扑超图拉普拉斯算子(PTHL)构建而成的,该技术将不同空间尺度上复杂的3D蛋白质-配体复合物系统地转换为NLP可接受的拓扑不变量和同伦形状序列。PTHL将复杂的3D蛋白质-配体复合物系统地转换为与NLP兼容的拓扑不变量和形状序列,捕捉跨空间尺度的基本相互作用。在几个基准数据集中,TopoFormer在排名、对接和筛选任务中展现出了出色的评分准确性和卓越性能。这种方法可用于将一般的高维结构化数据转换为与NLP兼容的序列,为更广泛的基于NLP的研究铺平道路。