用于水溶性预测的训练集的即时选择。

Zhang Hongzhou, Ando Howard Y, Chen Linna, Lee Pil H

Research Formulations and Computer-Assisted Drug Discovery, Pfizer Global Research & Development, Michigan Laboratories, 2800 Plymouth Road, Ann Arbor, Michigan 48105, USA.

Mol Pharm. 2007 Jul-Aug;4(4):489-97. doi: 10.1021/mp0700155. Epub 2007 Jul 12.

Training sets are usually chosen so that they represent the database as a whole; random selection helps to maintain this integrity. In this study, the prediction of aqueous solubility was used as a specific example of using the individual molecule for which solubility is desired, the target molecule, as the basis for choosing a training set. Similarity of the training set to the target molecule rather than a random allocation was used as the selection criteria. The Tanimoto coefficients derived from Daylight's binary fingerprints were used as the molecular similarity selection tool. Prediction models derived from this type of customization will be designated as "on-the-fly local" models because a new model is generated for each target molecule which is necessarily local. Such models will be compared with "global" models which are derived from a one-time "preprocessed" partitioning of training and test sets which use fixed fitted parameters for each target molecule prediction. Although both fragment and molecular descriptors were examined, a minimum set of MOE (molecular operating environment) molecular descriptors were found to be more efficient and were use for both on-the-fly local and preprocessed global models. It was found that on-the-fly local predictions were more accurate (r2=0.87) than the preprocessed global predictions (r2=0.74) for the same test set. In addition, their precision was shown to increase as the degree of similarity increases. Correlation and distribution plots were used to visualize similarity cutoff groupings and their chemical structures. In summary, rapid "on-the-fly" similarity selection can enable the customization of a training set to each target molecule for which solubility is desired. In addition, the similarity information and the model's fitting statistics give the user criteria to judge the validity of the prediction since it is always possible that good prediction cannot be obtained because the database and the target molecule are too dissimilar. Although the rapid processing speed of binary fingerprints enable the "on-the-fly" real time prediction, slower but more feature rich similarity measures may improve follow-up predictions.

训练集通常是经过挑选的，以便能整体代表数据库；随机选择有助于保持这种完整性。在本研究中，水溶解度预测被用作一个具体示例，即使用期望得到溶解度的单个分子（目标分子）作为选择训练集的基础。训练集与目标分子的相似性而非随机分配被用作选择标准。源自Daylight二进制指纹的Tanimoto系数被用作分子相似性选择工具。这种定制类型衍生出的预测模型将被指定为“即时局部”模型，因为针对每个必然是局部的目标分子都会生成一个新模型。此类模型将与“全局”模型进行比较，全局模型源自训练集和测试集的一次性“预处理”划分，在每个目标分子预测中使用固定的拟合参数。尽管对片段描述符和分子描述符都进行了研究，但发现最少的一组MOE（分子操作环境）分子描述符效率更高，并且在即时局部模型和预处理全局模型中都有使用。结果发现，对于相同的测试集，即时局部预测（r2 = 0.87）比预处理全局预测（r2 = 0.74）更准确。此外，随着相似程度的增加，其精度也会提高。相关性和分布图用于可视化相似性截止分组及其化学结构。总之，快速的“即时”相似性选择能够针对每个期望得到溶解度的目标分子定制训练集。此外，相似性信息和模型的拟合统计数据为用户提供了判断预测有效性的标准，因为由于数据库与目标分子差异过大而无法获得良好预测的情况总是有可能发生。尽管二进制指纹的快速处理速度能够实现“即时”实时预测，但较慢但特征更丰富的相似性度量可能会改善后续预测。