LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal.
J Cheminform. 2013 Feb 11;5(1):9. doi: 10.1186/1758-2946-5-9.
One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance.
The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach.
The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
定量构效关系(QSPR)预测模型发展的主要课题之一是确定代表分子结构并可预测特定性质的变量子集。有几种自动化特征选择方法,范围从后向、前向或逐步程序,到进一步详细阐述的方法,如进化编程。问题在于选择最小的描述符子集,以便以良好的性能、计算效率和更稳健的方式预测某种性质,因为无关或冗余特征的存在会导致较差的泛化能力。在本文中,提出了一种基于随机森林的替代选择方法,用于确定 QSPR 回归问题中的变量重要性,并将其应用于手动整理的数据集,以预测标准生成焓。随后的预测模型使用支持向量机进行训练,从基于变量重要性的排序列表中依次引入变量。
即使在高维数据集和高度相关变量的存在下,该模型也能很好地概括。特征选择步骤表明,与没有特征选择相比,RMSE 值降低了 23%,预测误差更低,尽管仅使用了总变量数的 6%(1485 个原始变量中的 89 个)。该方法还进一步优于其他特征选择方法和特征空间降维。使用 10 折交叉验证程序选择预测模型,然后使用独立集对其进行验证,以评估其在新数据上的性能,结果与训练集的结果相似,支持所提出方法的稳健性。
该方法似乎通过使用有限数量的分子描述符来提高碳氢化合物标准生成焓的预测性能,通过减少描述符的数量来加快和降低计算成本,并更好地理解描述符所表示的分子结构与感兴趣的性质之间的潜在关系。