Xie Jun-Zhong, Zhou Xu-Yuan, Luan Dong, Jiang Hong
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China.
J Chem Theory Comput. 2022 Jun 14;18(6):3795-3804. doi: 10.1021/acs.jctc.2c00017. Epub 2022 Jun 3.
Cluster expansion (CE) is a powerful theoretical tool to study the configuration-dependent properties of substitutionally disordered systems. Typically, a CE model is built by fitting a few tens or hundreds of target quantities calculated by first-principles approaches. To validate the reliability of the model, a convergence test of the cross-validation (CV) score to the training set size is commonly conducted to verify the sufficiency of the training data. However, such a test only confirms the convergence of the predictive capability of the CE model within the training set, and it is unknown whether the convergence of the CV score would lead to robust thermodynamic simulation results such as order-disorder phase transition temperature . In this work, using carbon defective MoC as a model system and aided by the machine-learning force field technique, a training data pool with about 13000 configurations has been efficiently obtained and used to generate different training sets of the same size randomly. By conducting parallel Monte Carlo simulations with the CE models trained with different randomly selected training sets, the uncertainty in calculated can be evaluated at different training set sizes. It is found that the training set size that is sufficient for the CV score to converge still leads to a significant uncertainty in the predicted and that the latter can be considerably reduced by enlarging the training set to that of a few thousand configurations. This work highlights the importance of using a large training set to build the optimal CE model that can achieve robust statistical modeling results and the facility provided by the machine-learning force field approach to efficiently produce adequate training data.
团簇展开(CE)是研究替代无序系统中与构型相关性质的一种强大理论工具。通常,通过拟合由第一性原理方法计算得到的几十或几百个目标量来构建CE模型。为了验证模型的可靠性,通常会对训练集大小进行交叉验证(CV)分数的收敛性测试,以验证训练数据的充分性。然而,这样的测试仅证实了CE模型在训练集内预测能力的收敛性,而CV分数的收敛是否会导致稳健的热力学模拟结果,如有序-无序相变温度,尚不清楚。在这项工作中,以碳缺陷的MoC作为模型系统,并借助机器学习力场技术,高效地获得了一个包含约13000个构型的训练数据集,并用于随机生成相同大小的不同训练集。通过使用不同随机选择的训练集训练的CE模型进行并行蒙特卡罗模拟,可以在不同训练集大小下评估计算得到的[具体物理量]的不确定性。研究发现,对于CV分数收敛来说足够的训练集大小,在预测的[具体物理量]中仍然会导致显著的不确定性,并且通过将训练集扩大到几千个构型,可以显著降低这种不确定性。这项工作强调了使用大训练集来构建能够实现稳健统计建模结果的最优CE模型的重要性,以及机器学习力场方法在有效生成足够训练数据方面提供的便利。