Cao Siqin, Nüske Feliks, Liu Bojun, Soley Micheline B, Huang Xuhui
Department of Chemistry, Theoretical Chemistry Institute, University of Wisconsin-Madison, Madison, Wisconsin 53706, United States.
Max-Planck-Institute for Dynamics of Complex Technical Systems, Magdeburg 39106, Germany.
J Chem Theory Comput. 2025 May 13;21(9):4855-4866. doi: 10.1021/acs.jctc.5c00076. Epub 2025 Apr 20.
Elucidating collective variables (CVs) for biomolecular dynamics is crucial for understanding numerous biological processes. By leveraging the tensor-train data structure, a multilinear version of the AMUSE (Algorithm for Multiple Unknown Signals) algorithm for Koopman approximation (AMUSEt) was recently developed to identify CVs for biomolecular dynamics. To find slow CVs, AMUSEt transforms input features (e.g., pairwise atomic distances) into nonlinear basis functions (e.g., Gaussian functions) and encodes these nonlinear basis functions within a tensor-train structure via time-lagged correlation functions. Due to the need to fit these tensor-train data structures into computer memory, AMUSEt can handle only a limited number of input features. Consequently, AMUSEt relies on manually selecting and ranking features based on physical intuition to fully capture the slow dynamics. However, when applied to complex biological systems with numerous features, this selection and ranking process becomes increasingly challenging. To address this challenge, here we present AMUSET-TICA (AMUSEt-based Time-lagged Independent Component Analysis), a CV-identification method using time-structure-independent components (tICs) as the input features for AMUSEt. The key insight of AMUSET-TICA lies in its highly effective embedding of high-dimensional atomistic protein conformations, achieved by expanding orthogonal tICs into overlapping Gaussian basis functions through a tensor-product data structure. This eliminates the need for manually selecting and ranking input features for a wide range of biomolecular systems. We demonstrate that AMUSET-TICA consistently and significantly outperforms AMUSEt and tICA in identifying slow CVs for three different biomolecular systems: alanine dipeptide, the N-terminal domain of L9 (NTL9), and the FIP35 WW domain. For all these systems, the CVs generated by AMUSET-TICA accurately describe the slowest dynamical modes underlying these biological conformational changes. Furthermore, we show that AMUSET-TICA achieves performance comparable to deep-learning approaches like VAMPnets in identifying the slowest dynamical modes, while being significantly more computationally efficient in terms of CPU time. In addition, the CVs yielded by AMUSET-TICA provide insights into the folding mechanisms of NTL9 and the FIP35 WW domain, including CV3 and CV4 of the WW domain, which capture its two parallel folding pathways. We expect AMUSET-TICA can be widely applied to facilitate the investigation of biomolecular dynamics.
阐明生物分子动力学的集体变量(CVs)对于理解众多生物过程至关重要。通过利用张量列车数据结构,最近开发了一种用于柯普曼近似的多线性版本的AMUSE(多未知信号算法)算法(AMUSEt),以识别生物分子动力学的CVs。为了找到缓慢的CVs,AMUSEt将输入特征(例如成对原子距离)转换为非线性基函数(例如高斯函数),并通过时间滞后相关函数在张量列车结构内对这些非线性基函数进行编码。由于需要将这些张量列车数据结构拟合到计算机内存中,AMUSEt只能处理有限数量的输入特征。因此,AMUSEt依赖于基于物理直觉手动选择和排列特征,以充分捕捉缓慢的动力学。然而,当应用于具有众多特征的复杂生物系统时,这种选择和排列过程变得越来越具有挑战性。为了应对这一挑战,我们在此提出AMUSET-TICA(基于AMUSEt的时间滞后独立成分分析),这是一种CV识别方法,使用时间结构独立成分(tICs)作为AMUSEt的输入特征。AMUSET-TICA的关键见解在于其对高维原子蛋白质构象的高效嵌入,这是通过张量积数据结构将正交tICs扩展为重叠高斯基函数来实现的。这消除了为广泛的生物分子系统手动选择和排列输入特征的需要。我们证明,在为三种不同的生物分子系统(丙氨酸二肽、L9的N端结构域(NTL9)和FIP35 WW结构域)识别缓慢CVs方面,AMUSET-TICA始终显著优于AMUSEt和tICA。对于所有这些系统,AMUSET-TICA生成的CVs准确地描述了这些生物构象变化背后最慢的动力学模式。此外,我们表明,AMUSET-TICA在识别最慢的动力学模式方面实现了与VAMPnets等深度学习方法相当的性能,同时在CPU时间方面计算效率显著更高。此外,AMUSET-TICA产生的CVs为NTL9和FIP35 WW结构域的折叠机制提供了见解,包括WW结构域的CV3和CV4,它们捕捉了其两条平行的折叠途径。我们期望AMUSET-TICA能够广泛应用于促进生物分子动力学的研究。