Stahura F L, Godden J W, Xue L, Bajorath J
Computer-Aided Drug Discovery, New Chemical Entities, Inc., Bothell, Washington 98011, USA.
J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1245-52. doi: 10.1021/ci0003303.
Molecular descriptors were identified by Shannon entropy analysis that correctly distinguished, in binary QSAR calculations, between naturally occurring molecules and synthetic compounds. The Shannon entropy concept was first used in digital communication theory and has only very recently been applied to descriptor analysis. Binary QSAR methodology was originally developed to correlate structural features and properties of compounds with a binary formulation of biological activity (i.e., active or inactive) and has here been adapted to correlate molecular features with chemical source (i.e., natural or synthetic). We have identified a number of molecular descriptors with significantly different Shannon entropy and/or "entropic separation" in natural and synthetic compound databases. Different combinations of such descriptors and variably distributed structural keys were applied to learning sets consisting of natural and synthetic molecules and used to derive predictive binary QSAR models. These models were then applied to predict the source of compounds in different test sets consisting of randomly collected natural and synthetic molecules, or, alternatively, sets of natural and synthetic molecules with specific biological activities. On average, greater than 80% prediction accuracy was achieved with our best models. For the test case consisting of molecules with specific activities, greater than 90% accuracy was achieved. From our analysis, some chemical features were identified that systematically differ in many naturally occurring versus synthetic molecules.
通过香农熵分析确定了分子描述符,在二元定量构效关系计算中,这些描述符能够正确区分天然存在的分子和合成化合物。香农熵概念最初用于数字通信理论,直到最近才应用于描述符分析。二元定量构效关系方法最初是为了将化合物的结构特征和性质与生物活性的二元表述(即活性或非活性)相关联而开发的,在此已被调整为将分子特征与化学来源(即天然或合成)相关联。我们在天然和合成化合物数据库中确定了许多具有显著不同香农熵和/或“熵分离”的分子描述符。将这些描述符的不同组合和可变分布的结构键应用于由天然和合成分子组成的学习集,并用于推导预测性二元定量构效关系模型。然后将这些模型应用于预测不同测试集中化合物的来源,这些测试集由随机收集的天然和合成分子组成,或者由具有特定生物活性的天然和合成分子组成。平均而言,我们最好的模型实现了超过80%的预测准确率。对于由具有特定活性的分子组成的测试案例,准确率超过了90%。通过我们的分析,确定了一些在许多天然存在的分子与合成分子中系统地不同的化学特征。