Vogt Martin, Bajorath Jürgen
Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstrasse 2, D-53113 Bonn, Germany.
J Chem Inf Model. 2008 Feb;48(2):247-55. doi: 10.1021/ci700333t. Epub 2008 Jan 30.
We investigate an approach that combines Bayesian modeling of probability distributions of descriptor values of active and database molecules with Kullback-Leibler analysis of the divergence between these distributions. The methodology is used for Bayesian screening and also to predict compound recall rates. In our study, we analyze two fundamental approximations underlying the Bayesian screening approach: the assumption that descriptors are independent of each other and, furthermore, that their data set values follow normal distributions. In addition, we calculate Kullback-Leibler divergence for single descriptors, rather than multiple-feature distributions, in order to prioritize descriptors for screening calculations. The results show that descriptor correlation effects, violating the assumption of feature independence, can lead to notable reduction of compound recall in Bayesian screening. Controlling descriptor correlation effects play a much more significant role for achieving high recall rates than approximating descriptor distributions by Gaussians. Furthermore, Kullback-Leibler divergence analysis is shown to systematically identify descriptors that are the most relevant for the outcome of Bayesian screening calculations.
我们研究了一种方法,该方法将活性分子和数据库分子描述符值的概率分布的贝叶斯建模与这些分布之间差异的库尔贝克-莱布勒分析相结合。该方法用于贝叶斯筛选,也用于预测化合物召回率。在我们的研究中,我们分析了贝叶斯筛选方法背后的两个基本近似:描述符相互独立的假设,以及此外它们的数据集值遵循正态分布的假设。此外,我们计算单个描述符的库尔贝克-莱布勒散度,而不是多特征分布的散度,以便为筛选计算确定描述符的优先级。结果表明,违反特征独立性假设的描述符相关效应会导致贝叶斯筛选中化合物召回率显著降低。控制描述符相关效应对于实现高召回率比用高斯分布近似描述符分布起着更为重要的作用。此外,库尔贝克-莱布勒散度分析被证明可以系统地识别与贝叶斯筛选计算结果最相关的描述符。