McGibbon Robert T, Pande Vijay S
Department of Chemistry, Stanford University, Stanford, California 94305, USA.
J Chem Phys. 2015 Mar 28;142(12):124105. doi: 10.1063/1.4916292.
Markov state models are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. Here, we consider cross-validation with a new objective function for estimators of these slow dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures the ability of a rank-m projection operator to capture the slow subspace of the system. It is shown that a variational theorem bounds the GMRQ from above by the sum of the first m eigenvalues of the system's propagator, but that this bound can be violated when the requisite matrix elements are estimated subject to statistical uncertainty. This overfitting can be detected and avoided through cross-validation. These result make it possible to construct Markov state models for protein dynamics in a way that appropriately captures the tradeoff between systematic and statistical errors.
马尔可夫状态模型是一种广泛使用的方法,用于近似分子动力学传播子的本征谱,从而深入了解生物分子系统的长时间尺度统计动力学和慢动力学模式。然而,缺乏一个统一的理论框架来在替代模型之间进行选择阻碍了进展,特别是对于将这些方法应用于新型生物系统的非专家而言。在这里,我们考虑使用一种新的目标函数进行交叉验证,该目标函数用于这些慢动力学模式的估计器,即广义矩阵瑞利商(GMRQ),它衡量秩为m的投影算子捕获系统慢子空间的能力。结果表明,一个变分定理将GMRQ从上方界定为系统传播子的前m个本征值之和,但当所需的矩阵元素在统计不确定性下进行估计时,这个界限可能会被违反。这种过拟合可以通过交叉验证来检测和避免。这些结果使得能够以适当捕捉系统误差和统计误差之间权衡的方式构建蛋白质动力学的马尔可夫状态模型。