EaStCHEM School of Chemistry, David Brewster Road, Joseph Black Building, The King's Buildings, Edinburgh EH9 3FJ, United Kingdom.
Redesign Science, 180 Varick St., New York, New York 10014, United States.
J Chem Theory Comput. 2024 Jan 23;20(2):977-988. doi: 10.1021/acs.jctc.3c01134. Epub 2024 Jan 1.
Markov state models (MSM) are a popular statistical method for analyzing the conformational dynamics of proteins including protein folding. With all statistical and machine learning (ML) models, choices must be made about the modeling pipeline that cannot be directly learned from the data. These choices, or hyperparameters, are often evaluated by expert judgment or, in the case of MSMs, by maximizing variational scores such as the VAMP-2 score. Modern ML and statistical pipelines often use automatic hyperparameter selection techniques ranging from the simple, choosing the best score from a random selection of hyperparameters, to the complex, optimization via, e.g., Bayesian optimization. In this work, we ask whether it is possible to automatically select MSM models this way by estimating and analyzing over 16,000,000 observations from over 280,000 estimated MSMs. We find that differences in hyperparameters can change the physical interpretation of the optimization objective, making automatic selection difficult. In addition, we find that enforcing conditions of equilibrium in the VAMP scores can result in inconsistent model selection. However, other parameters that specify the VAMP-2 score (lag time and number of relaxation processes scored) have only a negligible influence on model selection. We suggest that model observables and variational scores should be only a guide to model selection and that a full investigation of the MSM properties should be undertaken when selecting hyperparameters.
马尔可夫状态模型(MSM)是一种用于分析蛋白质构象动力学的流行统计方法,包括蛋白质折叠。与所有统计和机器学习(ML)模型一样,必须对建模管道做出选择,而这些选择无法直接从数据中学习。这些选择或超参数通常由专家判断来评估,或者在 MSM 的情况下,通过最大化变分分数(例如 VAMP-2 分数)来评估。现代 ML 和统计管道通常使用自动超参数选择技术,从简单的从超参数的随机选择中选择最佳分数,到复杂的通过贝叶斯优化等进行优化。在这项工作中,我们通过从超过 280,000 个估计的 MSM 中估计和分析超过 16,000,000 个观测值,来询问是否可以通过这种方式自动选择 MSM 模型。我们发现,超参数的差异可能会改变优化目标的物理解释,从而使自动选择变得困难。此外,我们发现,在 VAMP 分数中强制平衡条件会导致不一致的模型选择。然而,指定 VAMP-2 分数的其他参数(评分的滞后时间和松弛过程的数量)对模型选择只有微不足道的影响。我们建议模型可观察量和变分分数只能作为模型选择的指南,并且在选择超参数时应该对 MSM 特性进行全面调查。