Bush B L, Nachbar R B
Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ 07065.
J Comput Aided Mol Des. 1993 Oct;7(5):587-619. doi: 10.1007/BF00124364.
Three-dimensional molecular modeling can provide an unlimited number m of structural properties. Comparative Molecular Field Analysis (CoMFA), for example, may calculate thousands of field values for each model structure. When m is large, partial least squares (PLS) is the statistical method of choice for fitting and predicting biological responses. Yet PLS is usually implemented in a property-based fashion which is optimal only for small m. We describe here a sample-based formulation of PLS which can be used to fit any single response (bioactivity). SAMPLS reduces all explanatory data to the pairwise 'distances' among n samples (molecules), or equivalently to an n-by-n covariance matrix C. This matrix, unmodified, can be used to fit all PLS components. Furthermore, SAMPLS will validate the model by modern resampling techniques, at a cost independent of m. We have implemented SAMPLS as a Fortran program and have reproduced conventional and cross-validated PLS analyses of data from two published studies. Full (leave-each-out) cross-validation of a typical CoMFA takes 0.2 CPU s. SAMPLS is thus ideally suited to structure-activity analysis based on CoMFA fields or bonded topology. The sample-distance formulation also relates PLS to methods like cluster analysis and nonlinear mapping, and shows how drastically PLS simplifies the information in CoMFA fields.
三维分子建模可以提供数量无限的结构属性。例如,比较分子场分析(CoMFA)可以为每个模型结构计算数千个场值。当m值较大时,偏最小二乘法(PLS)是用于拟合和预测生物反应的首选统计方法。然而,PLS通常以基于属性的方式实现,这种方式仅在m值较小时才是最优的。我们在此描述一种基于样本的PLS公式,它可用于拟合任何单一反应(生物活性)。SAMPLS将所有解释性数据简化为n个样本(分子)之间的成对“距离”,或者等效地简化为一个n×n的协方差矩阵C。这个矩阵无需修改,可用于拟合所有PLS成分。此外,SAMPLS将通过现代重采样技术对模型进行验证,其成本与m无关。我们已将SAMPLS实现为一个Fortran程序,并重现了两项已发表研究数据的传统PLS分析和交叉验证PLS分析。典型CoMFA的完全(逐一排除)交叉验证需要0.2 CPU秒。因此,SAMPLS非常适合基于CoMFA场或键连拓扑的构效分析。样本距离公式还将PLS与聚类分析和非线性映射等方法联系起来,并展示了PLS如何极大地简化CoMFA场中的信息。