用于聚类分辨率特征选择算法的起始和终止数字估计：一种使用费舍尔比率零分布分析的实证方法。

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios.

作者信息

Adutwum Lawrence A, de la Mata A Paulina, Bean Heather D, Hill Jane E, Harynuk James J

机构信息

Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, Alberta, T6G 2G2, Canada.

School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ, 85287, USA.

出版信息

Anal Bioanal Chem. 2017 Nov;409(28):6699-6708. doi: 10.1007/s00216-017-0628-8. Epub 2017 Sep 29.

DOI:10.1007/s00216-017-0628-8

PMID:28963623

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9677961/

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. Graphical abstract Here, we describe how to determine the start and stop numbers for an automated feature selection routine, ensuring that you get the best model you can for your data with minimal effort.

摘要

聚类分辨率特征选择（CR-FS）是一种混合特征选择算法，它通过顺序向后消除（SBE）和顺序向前选择（SFS）来评估排序后的变量。CR-FS的实现需要两个主要输入，即起始数和终止数。起始数是SBE中排名靠前的变量数量，而终止数是SFS阶段停止搜索其他特征的点。这些关键参数的设置一直依赖于反复试验，这在所得结果中引入了主观性。已知起始数和终止数会因每个数据集而异。从重叠系数（一种比较两个概率密度函数的方法）中获得灵感，开发了用于估计数据集起始数和终止数的经验方程。经验方程中的所有参数都是通过比较两个概率密度函数获得的，除了常数d。使用三个真实世界的数据集对这些方程进行了优化。确定d的最佳范围为0.48至0.57。使用两个新数据集对CR-FS的实现证明了这种方法的有效性。使用此方法计算的起始数和终止数，两个数据集的偏最小二乘判别分析（PLS-DA）模型预测准确率从90%和96%提高到了100%。此外，在前两个主成分中捕获的解释方差增加了两倍。图形摘要在这里，我们描述了如何为自动特征选择例程确定起始数和终止数，确保您以最小的努力为数据获得最佳模型。

相似文献

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios.

Anal Bioanal Chem. 2017 Nov;409(28):6699-6708. doi: 10.1007/s00216-017-0628-8. Epub 2017 Sep 29.

Computational advances of tumor marker selection and sample classification in cancer proteomics.

Comput Struct Biotechnol J. 2020 Jul 17;18:2012-2025. doi: 10.1016/j.csbj.2020.07.009. eCollection 2020.

Gene features selection for three-class disease classification via multiple orthogonal partial least square discriminant analysis and S-plot using microarray data.

PLoS One. 2013 Dec 30;8(12):e84253. doi: 10.1371/journal.pone.0084253. eCollection 2013.

A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.

Anal Chim Acta. 2015 Jun 16;879:10-23. doi: 10.1016/j.aca.2015.02.012. Epub 2015 Feb 11.

A Novel Feature Selection Approach Based on Tree Models for Evaluating the Punching Shear Capacity of Steel Fiber-Reinforced Concrete Flat Slabs.

Materials (Basel). 2020 Sep 3;13(17):3902. doi: 10.3390/ma13173902.

Random KNN feature selection - a fast and stable alternative to Random Forests.

BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.

Improved variable reduction in partial least squares modelling based on predictive-property-ranked variables and adaptation of partial least squares complexity.

Anal Chim Acta. 2011 Oct 31;705(1-2):292-305. doi: 10.1016/j.aca.2011.06.037. Epub 2011 Jun 29.

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction.

Comput Intell Neurosci. 2021 Nov 24;2021:5069016. doi: 10.1155/2021/5069016. eCollection 2021.

Primal-dual for classification with rejection (PD-CR): a novel method for classification and feature selection-an application in metabolomics studies.

BMC Bioinformatics. 2021 Dec 15;22(1):594. doi: 10.1186/s12859-021-04478-w.

Feature selection based on distance correlation: a filter algorithm.

J Appl Stat. 2020 Sep 7;49(2):411-426. doi: 10.1080/02664763.2020.1815672. eCollection 2022.

引用本文的文献

Preclinical modeling of metabolic syndrome to study the pleiotropic effects of novel antidiabetic therapy independent of obesity.

Sci Rep. 2024 Sep 5;14(1):20665. doi: 10.1038/s41598-024-71202-y.

Dietary benzoic acid and supplemental enzymes alter fiber-fermenting taxa and metabolites in the cecum of weaned pigs.

J Anim Sci. 2022 Nov 1;100(11). doi: 10.1093/jas/skac324.

本文引用的文献

Comprehensive two-dimensional gas chromatographic profiling and chemometric interpretation of the volatile profiles of sweat in knit fabrics.

Anal Bioanal Chem. 2017 Mar;409(7):1905-1913. doi: 10.1007/s00216-016-0137-1. Epub 2016 Dec 27.

Classification of biodiesel and fuel blends using gas chromatography - differential mobility spectrometry with cluster analysis and isolation of C18:3 me by dual ion filtering.

Talanta. 2016 Aug 1;155:278-88. doi: 10.1016/j.talanta.2016.04.044. Epub 2016 Apr 23.

Comparative metabolite profiling and fingerprinting of genus Passiflora leaves using a multiplex approach of UPLC-MS and NMR analyzed by chemometric tools.

Anal Bioanal Chem. 2016 May;408(12):3125-43. doi: 10.1007/s00216-016-9376-4. Epub 2016 Feb 16.

Big Data: Astronomical or Genomical?

PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.

Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis.

Forensic Sci Int. 2015 Jul;252:177-86. doi: 10.1016/j.forsciint.2015.04.035. Epub 2015 May 13.

Tile-based Fisher ratio analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC × GC-TOFMS) data using a null distribution approach.

Anal Chem. 2015 Apr 7;87(7):3812-9. doi: 10.1021/ac504472s. Epub 2015 Mar 26.

DWFS: a wrapper feature selection tool based on a parallel genetic algorithm.

PLoS One. 2015 Feb 26;10(2):e0117988. doi: 10.1371/journal.pone.0117988. eCollection 2015.

Comprehensive two-dimensional gas chromatography and food sensory properties: potential and challenges.

Anal Bioanal Chem. 2015 Jan;407(1):169-91. doi: 10.1007/s00216-014-8248-z. Epub 2014 Oct 30.

Unique ion filter: a data reduction tool for GC/MS data preprocessing prior to chemometric analysis.

Anal Chem. 2014 Aug 5;86(15):7726-33. doi: 10.1021/ac501660a. Epub 2014 Jul 14.

Discrimination of cherry wines based on their sensory properties and aromatic fingerprinting using HS-SPME-GC-MS and multivariate analysis.

J Food Sci. 2014 Mar;79(3):C284-94. doi: 10.1111/1750-3841.12362. Epub 2014 Feb 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于聚类分辨率特征选择算法的起始和终止数字估计：一种使用费舍尔比率零分布分析的实证方法。

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献