Feng Minggao, Zhao Chengxi, Day Graeme M, Evangelopoulos Xenophon, Cooper Andrew I
Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool UK
School of Chemistry and Chemical Engineering, University of Southampton Southampton UK
Chem Sci. 2025 May 21. doi: 10.1039/d5sc00677e.
The physical and chemical properties of molecular crystals are a combined function of molecular structure and the molecular crystal packing. Specific crystal packings can enable applications such as pharmaceuticals, organic electronics, and porous materials for gas storage. However, to design such materials, we need to predict both crystal structure and the resulting physical properties, and this is expensive using traditional computational methods. Machine-learned interatomic potential methods offer major accelerations here, but molecular crystal structure prediction remains challenging due to the weak intermolecular interactions that dictate crystal packing. Moreover, machine-learned interatomic potentials do not accelerate the prediction of all physical properties for molecular crystals. Here we present Molecular Crystal Representation from Transformers (MCRT), a transformer-based model for molecular crystal property prediction that is pre-trained on 706 126 experimental crystal structures extracted from the Cambridge Structural Database (CSD). MCRT employs four different pre-training tasks to extract both local and global representations from the crystals using multi-modal features to encode crystal structure and geometry. MCRT has the potential to serve as a universal foundation model for predicting a range of properties for molecular crystals, achieving state-of-the-art results even when fine-tuned on small-scale datasets. We demonstrate MCRT's practical utility in both crystal property prediction and crystal structure prediction. We also show that model predictions can be interpreted by using attention scores.
分子晶体的物理和化学性质是分子结构与分子晶体堆积的综合函数。特定的晶体堆积能够实现诸如药物、有机电子学以及用于气体储存的多孔材料等应用。然而,要设计此类材料,我们需要预测晶体结构以及由此产生的物理性质,而使用传统计算方法进行预测成本高昂。机器学习的原子间势方法在此处能大幅加速计算,但由于决定晶体堆积的分子间相互作用较弱,分子晶体结构预测仍然具有挑战性。此外,机器学习的原子间势并不能加速分子晶体所有物理性质的预测。在此,我们提出了基于变压器的分子晶体性质预测模型——分子晶体变压器表示(MCRT),该模型在从剑桥结构数据库(CSD)提取的706126个实验晶体结构上进行了预训练。MCRT采用四种不同的预训练任务,利用多模态特征对晶体结构和几何形状进行编码,从晶体中提取局部和全局表示。MCRT有潜力作为一个通用基础模型,用于预测分子晶体的一系列性质,即使在小规模数据集上进行微调时也能取得最优结果。我们展示了MCRT在晶体性质预测和晶体结构预测方面的实际效用。我们还表明,可以通过注意力分数来解释模型预测结果。