Teixeira Ana L, Falcao Andre O
LaSIGE, Faculty of Sciences, University of Lisbon , 1749-016 Lisbon, Portugal.
J Chem Inf Model. 2014 Jul 28;54(7):1833-49. doi: 10.1021/ci500110v. Epub 2014 Jun 25.
Structurally similar molecules tend to have similar properties, i.e. closer molecules in the molecular space are more likely to yield similar property values while distant molecules are more likely to yield different values. Based on this principle, we propose the use of a new method that takes into account the high dimensionality of the molecular space, predicting chemical, physical, or biological properties based on the most similar compounds with measured properties. This methodology uses ordinary kriging coupled with three different molecular similarity approaches (based on molecular descriptors, fingerprints, and atom matching) which creates an interpolation map over the molecular space that is capable of predicting properties/activities for diverse chemical data sets. The proposed method was tested in two data sets of diverse chemical compounds collected from the literature and preprocessed. One of the data sets contained dihydrofolate reductase inhibition activity data, and the second molecules for which aqueous solubility was known. The overall predictive results using kriging for both data sets comply with the results obtained in the literature using typical QSPR/QSAR approaches. However, the procedure did not involve any type of descriptor selection or even minimal information about each problem, suggesting that this approach is directly applicable to a large spectrum of problems in QSAR/QSPR. Furthermore, the predictive results improve significantly with the similarity threshold between the training and testing compounds, allowing the definition of a confidence threshold of similarity and error estimation for each case inferred. The use of kriging for interpolation over the molecular metric space is independent of the training data set size, and no reparametrizations are necessary when more compounds are added or removed from the set, and increasing the size of the database will consequentially improve the quality of the estimations. Finally it is shown that this model can be used for checking the consistency of measured data and for guiding an extension of the training set by determining the regions of the molecular space for which new experimental measurements could be used to maximize the model's predictive performance.
结构相似的分子往往具有相似的性质,即分子空间中距离较近的分子更有可能产生相似的性质值,而距离较远的分子则更有可能产生不同的值。基于这一原理,我们提出使用一种新方法,该方法考虑到分子空间的高维性,基于具有已知性质的最相似化合物来预测化学、物理或生物学性质。这种方法使用普通克里金法结合三种不同的分子相似性方法(基于分子描述符、指纹和原子匹配),在分子空间上创建一个插值图,能够预测各种化学数据集的性质/活性。所提出的方法在从文献中收集并经过预处理的两个不同化合物数据集中进行了测试。其中一个数据集包含二氢叶酸还原酶抑制活性数据,第二个数据集包含已知水溶性的分子。使用克里金法对两个数据集进行的总体预测结果与使用典型定量结构-性质关系/定量结构-活性关系方法在文献中获得的结果一致。然而,该过程不涉及任何类型的描述符选择,甚至对每个问题的信息了解极少,这表明该方法可直接应用于定量结构-活性关系/定量结构-性质关系中的大量问题。此外,随着训练化合物和测试化合物之间相似性阈值的提高,预测结果显著改善,从而可以为每种推断情况定义相似性置信阈值和误差估计。在分子度量空间上使用克里金法进行插值与训练数据集的大小无关,当从数据集中添加或删除更多化合物时无需重新参数化,并且增加数据库的大小将相应提高估计质量。最后表明,该模型可用于检查测量数据的一致性,并通过确定分子空间中可用于新实验测量以最大化模型预测性能的区域来指导训练集的扩展。