Wegner Jörg K, Zell Andreas
Zentrum für Bioinformatik Tübingen, Universität Tübingen, Sand 1, D-72076 Tübingen, Germany.
J Chem Inf Comput Sci. 2003 May-Jun;43(3):1077-84. doi: 10.1021/ci034006u.
The paper describes a fast and flexible descriptor selection method using a genetic algorithm variant (GA-SEC). The relevance of the descriptors will be measured using Shannon entropy (SE) and differential Shannon entropy (DSE), which have very sparse memory requirements and allow the processing of huge data sets. A small quantity of the most important descriptors will be used automatically to build a value prediction model. The most important descriptors are not a linear combination of other descriptors, but transparent, pure descriptors. We used an artificial neural network (ANN) model to predict the aqueous solubility logS and the octanol/water partition coefficient logP. The logS data set was divided into a training set of 1016 compounds and a test set of 253 compounds. A correlation coefficient of 0.93 and an empirical standard deviation of 0.54 were achieved. The logP data set was divided into a training set of 1853 compounds and a test set of 138 compounds. A correlation coefficient of 0.92 and an empirical standard deviation of 0.44 were achieved.
本文描述了一种使用遗传算法变体(GA-SEC)的快速灵活的描述符选择方法。描述符的相关性将使用香农熵(SE)和差分香农熵(DSE)来衡量,它们具有非常稀疏的内存需求,并允许处理海量数据集。少量最重要的描述符将被自动用于构建值预测模型。最重要的描述符不是其他描述符的线性组合,而是透明的、纯粹的描述符。我们使用人工神经网络(ANN)模型来预测水溶性logS和辛醇/水分配系数logP。logS数据集被分为一个包含1016种化合物的训练集和一个包含253种化合物的测试集。相关系数达到0.93,经验标准差为0.54。logP数据集被分为一个包含1853种化合物的训练集和一个包含138种化合物的测试集。相关系数达到0.92,经验标准差为0.44。