在k均值聚类算法中应用费希尔判别比进行特征选择和数据集修剪。

Implementing the Fisher's discriminant ratio in a k-means clustering algorithm for feature selection and data set trimming.

作者信息

Lin Thy-Hou, Li Huang-Te, Tsai Keng-Chang

机构信息

Institute of Molecular Medicine & Department of Life Science, National Tsing Hua University, Hsinchu, Taiwan 30013, ROC.

出版信息

J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):76-87. doi: 10.1021/ci030295a.

DOI:10.1021/ci030295a

PMID:14741013

Abstract

The Fisher's discriminant ratio has been used as a class separability criterion and implemented in a k-means clustering algorithm for performing simultaneous feature selection and data set trimming on a set of 221 HIV-1 protease inhibitors. The total number of molecular descriptors computed for each inhibitor is 43, and they are scaled to lie between 1 and 0 before being subjected to the feature selection process. Since the purpose is to select some of the most class sensitive descriptors, several feature evaluation indices such as the Shannon entropy, the linear regression of selected descriptors on the pKi of selected inhibitors, and a stepwise variable selection program are used to filter them. While the Shannon entropy provides the information content for each descriptor computed, more class sensitive descriptors are searched by both the linear regression and stepwise variable selection procedures. The inhibitors are divided into several different numbers of classes. They are subsequently divided into five classes due to the fact that the best feature selection result is obtained by the division. Most of the good features selected are the topological descriptors, and they are correlated well with the pKi values. The outliers or the inhibitors with less class-sensitive descriptor values computed for each selected descriptor are identified and gathered by the k-means clustering algorithm. These are the trimmed inhibitors, while the remaining ones are retained or selected. We find that 44% or 98 inhibitors can be retained when the number of good descriptors selected for clustering is three. The descriptor values of these selected inhibitors are far more class sensitive than the original ones as evidenced by substantial increasing in statistical significance when they are subjected to both the SYBYL CoMFA PLS and Cerius2 PLS regression analyses.

摘要

费希尔判别比已被用作类可分离性标准，并在k均值聚类算法中实现，用于对一组221种HIV-1蛋白酶抑制剂进行同步特征选择和数据集修剪。为每种抑制剂计算的分子描述符总数为43个，在进行特征选择过程之前，将它们缩放到1和0之间。由于目的是选择一些对类别最敏感的描述符，因此使用了几种特征评估指标，如香农熵、所选描述符对所选抑制剂的pKi的线性回归以及逐步变量选择程序来对其进行筛选。虽然香农熵提供了为每个计算出的描述符的信息内容，但通过线性回归和逐步变量选择程序来搜索更多对类别敏感的描述符。抑制剂被分为几个不同数量的类别。由于通过该划分获得了最佳特征选择结果，它们随后被分为五类。所选的大多数良好特征是拓扑描述符，并且它们与pKi值具有良好的相关性。通过k均值聚类算法识别并收集了每个所选描述符计算出的异常值或具有较低类别敏感描述符值的抑制剂。这些就是修剪后的抑制剂，而其余的则被保留或选择。我们发现，当为聚类选择的良好描述符数量为三个时，可以保留44%或98种抑制剂。当对这些所选抑制剂进行SYBYL CoMFA PLS和Cerius2 PLS回归分析时，统计显著性大幅提高，这表明这些所选抑制剂的描述符值比原始描述符值对类别更敏感。

相似文献

Implementing the Fisher's discriminant ratio in a k-means clustering algorithm for feature selection and data set trimming.在k均值聚类算法中应用费希尔判别比进行特征选择和数据集修剪。

J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):76-87. doi: 10.1021/ci030295a.

Quantitative structure-activity relationship modeling of juvenile hormone mimetic compounds for Culex pipiens larvae, with a discussion of descriptor-thinning methods.致倦库蚊幼虫保幼激素模拟化合物的定量构效关系建模及描述符精简方法探讨

J Chem Inf Model. 2006 Jan-Feb;46(1):65-77. doi: 10.1021/ci050215y.

Exploring molecular shape analysis of styrylquinoline derivatives as HIV-1 integrase inhibitors.探索作为HIV-1整合酶抑制剂的苯乙烯基喹啉衍生物的分子形状分析。

Eur J Med Chem. 2008 Jan;43(1):81-92. doi: 10.1016/j.ejmech.2007.02.021. Epub 2007 Mar 14.

Predictive QSAR modeling of HIV reverse transcriptase inhibitor TIBO derivatives.HIV逆转录酶抑制剂替博（TIBO）衍生物的预测性定量构效关系建模

Eur J Med Chem. 2009 Apr;44(4):1509-24. doi: 10.1016/j.ejmech.2008.07.020. Epub 2008 Jul 24.

Feature selection for descriptor based classification models. 2. Human intestinal absorption (HIA).基于描述符的分类模型的特征选择。2. 人体肠道吸收（HIA）。

J Chem Inf Comput Sci. 2004 May-Jun;44(3):931-9. doi: 10.1021/ci034233w.

An improved approximation to the estimation of the critical F values in best subset regression.最佳子集回归中临界F值估计的一种改进近似方法。

J Chem Inf Model. 2007 Jan-Feb;47(1):143-9. doi: 10.1021/ci060113n.

Simultaneous feature selection and clustering using mixture models.使用混合模型进行同步特征选择和聚类

IEEE Trans Pattern Anal Mach Intell. 2004 Sep;26(9):1154-66. doi: 10.1109/TPAMI.2004.71.

QSTR with extended topochemical atom (ETA) indices. 12. QSAR for the toxicity of diverse aromatic compounds to Tetrahymena pyriformis using chemometric tools.具有扩展拓扑化学原子（ETA）指数的定量结构-性质关系。12. 使用化学计量学工具对多种芳香族化合物对梨形四膜虫毒性的定量构效关系研究。

Chemosphere. 2009 Nov;77(7):999-1009. doi: 10.1016/j.chemosphere.2009.07.072. Epub 2009 Aug 25.

What should be expected from feature selection in small-sample settings.在小样本情况下，特征选择应达到什么预期效果。

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Supervised feature ranking using a genetic algorithm optimized artificial neural network.使用遗传算法优化的人工神经网络进行监督特征排序。

J Chem Inf Model. 2006 Jul-Aug;46(4):1604-14. doi: 10.1021/ci0600354.

引用本文的文献

OPTIMAL: An OPTimized Imaging Mass cytometry AnaLysis framework for benchmarking segmentation and data exploration.OPTIMAL：用于基准测试分割和数据探索的优化成像质谱细胞分析框架。

Cytometry A. 2024 Jan;105(1):36-53. doi: 10.1002/cyto.a.24803. Epub 2023 Oct 5.

Identification of functional gene modules by integrating multi-omics data and known molecular interactions.通过整合多组学数据和已知分子相互作用来鉴定功能基因模块。

Front Genet. 2023 Jan 24;14:1082032. doi: 10.3389/fgene.2023.1082032. eCollection 2023.

Machine learning in the prediction of cancer therapy.机器学习在癌症治疗预测中的应用

Comput Struct Biotechnol J. 2021 Jul 8;19:4003-4017. doi: 10.1016/j.csbj.2021.07.003. eCollection 2021.

Synthesis of 2-alkylthio--(quinazolin-2-yl)benzenesulfonamide derivatives: anticancer activity, QSAR studies, and metabolic stability.2-烷硫基--(喹唑啉-2-基)苯磺酰胺衍生物的合成：抗癌活性、定量构效关系研究及代谢稳定性

Monatsh Chem. 2018;149(10):1885-1898. doi: 10.1007/s00706-018-2251-6. Epub 2018 Jul 13.

A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification.一种基于聚类后标记的半监督学习方法在病理图像分类中的应用。

Sci Rep. 2018 May 8;8(1):7193. doi: 10.1038/s41598-018-24876-0.

Automatic discrimination between safe and unsafe swallowing using a reputation-based classifier.基于信誉的分类器自动区分安全和不安全的吞咽。

Biomed Eng Online. 2011 Nov 15;10:100. doi: 10.1186/1475-925X-10-100.

Quantitative structure-activity relationship by CoMFA for cyclic urea and nonpeptide-cyclic cyanoguanidine derivatives on wild type and mutant HIV-1 protease.

J Mol Model. 2005 Mar;11(2):105-15. doi: 10.1007/s00894-004-0226-5. Epub 2005 Feb 16.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在k均值聚类算法中应用费希尔判别比进行特征选择和数据集修剪。

Implementing the Fisher's discriminant ratio in a k-means clustering algorithm for feature selection and data set trimming.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献