Lin Thy-Hou, Li Huang-Te, Tsai Keng-Chang
Institute of Molecular Medicine & Department of Life Science, National Tsing Hua University, Hsinchu, Taiwan 30013, ROC.
J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):76-87. doi: 10.1021/ci030295a.
The Fisher's discriminant ratio has been used as a class separability criterion and implemented in a k-means clustering algorithm for performing simultaneous feature selection and data set trimming on a set of 221 HIV-1 protease inhibitors. The total number of molecular descriptors computed for each inhibitor is 43, and they are scaled to lie between 1 and 0 before being subjected to the feature selection process. Since the purpose is to select some of the most class sensitive descriptors, several feature evaluation indices such as the Shannon entropy, the linear regression of selected descriptors on the pKi of selected inhibitors, and a stepwise variable selection program are used to filter them. While the Shannon entropy provides the information content for each descriptor computed, more class sensitive descriptors are searched by both the linear regression and stepwise variable selection procedures. The inhibitors are divided into several different numbers of classes. They are subsequently divided into five classes due to the fact that the best feature selection result is obtained by the division. Most of the good features selected are the topological descriptors, and they are correlated well with the pKi values. The outliers or the inhibitors with less class-sensitive descriptor values computed for each selected descriptor are identified and gathered by the k-means clustering algorithm. These are the trimmed inhibitors, while the remaining ones are retained or selected. We find that 44% or 98 inhibitors can be retained when the number of good descriptors selected for clustering is three. The descriptor values of these selected inhibitors are far more class sensitive than the original ones as evidenced by substantial increasing in statistical significance when they are subjected to both the SYBYL CoMFA PLS and Cerius2 PLS regression analyses.
费希尔判别比已被用作类可分离性标准,并在k均值聚类算法中实现,用于对一组221种HIV-1蛋白酶抑制剂进行同步特征选择和数据集修剪。为每种抑制剂计算的分子描述符总数为43个,在进行特征选择过程之前,将它们缩放到1和0之间。由于目的是选择一些对类别最敏感的描述符,因此使用了几种特征评估指标,如香农熵、所选描述符对所选抑制剂的pKi的线性回归以及逐步变量选择程序来对其进行筛选。虽然香农熵提供了为每个计算出的描述符的信息内容,但通过线性回归和逐步变量选择程序来搜索更多对类别敏感的描述符。抑制剂被分为几个不同数量的类别。由于通过该划分获得了最佳特征选择结果,它们随后被分为五类。所选的大多数良好特征是拓扑描述符,并且它们与pKi值具有良好的相关性。通过k均值聚类算法识别并收集了每个所选描述符计算出的异常值或具有较低类别敏感描述符值的抑制剂。这些就是修剪后的抑制剂,而其余的则被保留或选择。我们发现,当为聚类选择的良好描述符数量为三个时,可以保留44%或98种抑制剂。当对这些所选抑制剂进行SYBYL CoMFA PLS和Cerius2 PLS回归分析时,统计显著性大幅提高,这表明这些所选抑制剂的描述符值比原始描述符值对类别更敏感。