Vriza Aikaterini, Canaj Angelos B, Vismara Rebecca, Kershaw Cook Laurence J, Manning Troy D, Gaultois Michael W, Wood Peter A, Kurlin Vitaliy, Berry Neil, Dyer Matthew S, Rosseinsky Matthew J
Department of Chemistry and Materials Innovation Factory, University of Liverpool 51 Oxford Street Liverpool L7 3NY UK
Leverhulme Research Centre for Functional Materials Design, University of Liverpool Oxford Street Liverpool L7 3NY UK.
Chem Sci. 2020 Dec 8;12(5):1702-1719. doi: 10.1039/d0sc04263c.
The implementation of machine learning models has brought major changes in the decision-making process for materials design. One matter of concern for the data-driven approaches is the lack of negative data from unsuccessful synthetic attempts, which might generate inherently imbalanced datasets. We propose the application of the one-class classification methodology as an effective tool for tackling these limitations on the materials design problems. This is a concept of learning based only on a well-defined class without counter examples. An extensive study on the different one-class classification algorithms is performed until the most appropriate workflow is identified for guiding the discovery of emerging materials belonging to a relatively small class, that being the weakly bound polyaromatic hydrocarbon co-crystals. The two-step approach presented in this study first trains the model using all the known molecular combinations that form this class of co-crystals extracted from the Cambridge Structural Database (1722 molecular combinations), followed by scoring possible yet unknown pairs from the ZINC15 database (21 736 possible molecular combinations). Focusing on the highest-ranking pairs predicted to have higher probability of forming co-crystals, materials discovery can be accelerated by reducing the vast molecular space and directing the synthetic efforts of chemists. Further on, using interpretability techniques a more detailed understanding of the molecular properties causing co-crystallization is sought after. The applicability of the current methodology is demonstrated with the discovery of two novel co-crystals, namely pyrene-6-benzo[]chromen-6-one () and pyrene-9,10-dicyanoanthracene ().
机器学习模型的应用给材料设计的决策过程带来了重大变革。数据驱动方法的一个关注点是缺乏来自未成功合成尝试的负面数据,这可能会产生内在不平衡的数据集。我们提出将单类分类方法作为解决材料设计问题中这些局限性的有效工具。这是一种仅基于一个定义明确的类别进行学习而没有反例的概念。我们对不同的单类分类算法进行了广泛研究,直到确定最合适的工作流程,以指导发现属于相对较小类别的新兴材料,即弱键合多环芳烃共晶体。本研究中提出的两步法首先使用从剑桥结构数据库中提取的形成此类共晶体的所有已知分子组合(1722个分子组合)训练模型,然后对ZINC15数据库中可能但未知的对进行评分(21736个可能的分子组合)。专注于预测形成共晶体概率较高的排名靠前的对,可以通过减少巨大的分子空间并指导化学家的合成工作来加速材料发现。此外,使用可解释性技术,寻求对导致共结晶的分子性质有更详细的了解。通过发现两种新型共晶体,即芘 - 6 - 苯并[]色烯 - 6 - 酮()和芘 - 9,10 - 二氰基蒽(),证明了当前方法的适用性。