Goscinski Alexander, Musil Félix, Pozdnyakov Sergey, Nigam Jigyasa, Ceriotti Michele
Laboratory of Computational Science and Modeling, Institute of Materials, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland.
J Chem Phys. 2021 Sep 14;155(10):104106. doi: 10.1063/5.0057229.
The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here, we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can build a unique basis that is optimal in this sense and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when working with representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
几乎每一种针对原子尺度物质特性的机器学习算法的输入,都涉及将笛卡尔原子坐标列表转换为更对称的表示形式。许多最流行的表示形式可以看作是原子密度对称相关性的展开,主要区别在于基的选择。人们投入了大量精力来优化基组,通常是由对回归目标行为的启发式考虑驱动的。在这里,我们采用一种不同的、无监督的观点,旨在确定以尽可能紧凑的方式编码与手头数据集相关的结构信息的基。对于每个训练数据集和基函数数量,可以构建一个在这个意义上最优的唯一基,并且通过用样条近似它,可以相对于原始基以不增加额外成本的方式计算出来。我们证明,这种构造产生的表示形式既准确又计算高效,特别是在处理与高阶体相关性对应的表示形式时。我们给出了涉及分子和凝聚相机器学习模型的示例。