Kaspi Omer, Yosipof Abraham, Senderowitz Hanoch
Department of Systems Engineering, Afeka - Tel-Aviv Academic College of Engineering, Tel-Aviv, Israel.
Department of Chemistry, Bar-Ilan University, 5290002, Ramat-Gan, Israel.
J Cheminform. 2017 Jun 6;9(1):34. doi: 10.1186/s13321-017-0224-0.
An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a "one stop shop" algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For "future" predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions.
化学信息学和材料信息学的一个重要方面是使用机器学习算法来构建定量构效关系(QSAR)模型。随机抽样一致性(RANSAC)算法是一种预测建模工具,在图像处理领域广泛用于清理数据集中的噪声。RANSAC可以用作开发和验证QSAR模型、去除异常值、选择描述符、使用适用域进行模型开发以及对测试集样本进行预测的“一站式”算法。对于“未来”预测(即针对原始测试集中未包含的样本),RANSAC提供了获得可靠预测(即与真实值在预定义标准差数量内的预测)概率的统计估计。在这项工作中,我们描述了RANSAC在材料信息学中的首次应用,重点是太阳能电池的分析。我们证明,对于代表不同基于金属氧化物(MO)的太阳能电池库的三个数据集,RANSAC衍生的模型选择了先前显示与关键光伏特性相关的描述符,并为这些特性带来了良好的预测统计结果。这些模型随后被用于预测虚拟太阳能电池库的特性,突出了光伏特性对MO组成的有趣依赖性。