Laboratory of Informatics and Data Mining (LIDM), Department of Computer and Information Science, Fordham University, 113 West 60th Street, New York, New York 10023, United States.
Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California 94720, United States.
J Chem Inf Model. 2021 Apr 26;61(4):1593-1602. doi: 10.1021/acs.jcim.0c01307. Epub 2021 Apr 2.
Combinatorial fusion analysis (CFA) is an approach for combining multiple scoring systems using the rank-score characteristic function and cognitive diversity measure. One example is to combine diverse machine learning models to achieve better prediction quality. In this work, we apply CFA to the synthesis of metal halide perovskites containing organic ammonium cations via inverse temperature crystallization. Using a data set generated by high-throughput experimentation, four individual models (support vector machines, random forests, weighted logistic classifier, and gradient boosted trees) were developed. We characterize each of these scoring systems and explore 66 possible combinations of the models. When measured by the precision on predicting crystal formation, the majority of the combination models improves the individual model results. The best combination models outperform the best individual models by 3.9 percentage points in precision. In addition to improving prediction quality, we demonstrate how the fusion models can be used to identify mislabeled input data and address issues of data quality. In particular, we identify example cases where all single models and all fusion models do not give the correct prediction. Experimental replication of these syntheses reveals that these compositions are sensitive to modest temperature variations across the different locations of the heating element that can hinder or enhance the crystallization process. In summary, we demonstrate that model fusion using CFA can not only identify a previously unconsidered influence on reaction outcome but also be used as a form of quality control for high-throughput experimentation.
组合融合分析(CFA)是一种使用秩评分特征函数和认知多样性度量来组合多个评分系统的方法。一个例子是将不同的机器学习模型组合起来以获得更好的预测质量。在这项工作中,我们通过逆温度结晶将 CFA 应用于含有有机铵阳离子的卤化金属钙钛矿的合成。使用高通量实验生成的数据集,我们开发了四个单独的模型(支持向量机、随机森林、加权逻辑分类器和梯度提升树)。我们对每个评分系统进行了特征描述,并探索了模型之间 66 种可能的组合。当通过预测晶体形成的精度来衡量时,大多数组合模型都提高了单个模型的结果。最佳组合模型在精度上比最佳单个模型高出 3.9 个百分点。除了提高预测质量外,我们还展示了融合模型如何用于识别标记错误的输入数据并解决数据质量问题。特别是,我们确定了在所有单个模型和所有融合模型都无法给出正确预测的情况下的示例情况。这些合成的实验复制表明,这些组成对加热元件不同位置的适度温度变化很敏感,这可能会阻碍或增强结晶过程。总之,我们证明了使用 CFA 的模型融合不仅可以识别对反应结果的以前未考虑的影响,还可以用作高通量实验的质量控制形式。