Suppr超能文献

用于聚类分辨率特征选择算法的起始和终止数字估计:一种使用费舍尔比率零分布分析的实证方法。

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios.

作者信息

Adutwum Lawrence A, de la Mata A Paulina, Bean Heather D, Hill Jane E, Harynuk James J

机构信息

Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, Alberta, T6G 2G2, Canada.

School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ, 85287, USA.

出版信息

Anal Bioanal Chem. 2017 Nov;409(28):6699-6708. doi: 10.1007/s00216-017-0628-8. Epub 2017 Sep 29.

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. Graphical abstract Here, we describe how to determine the start and stop numbers for an automated feature selection routine, ensuring that you get the best model you can for your data with minimal effort.

摘要

聚类分辨率特征选择(CR-FS)是一种混合特征选择算法,它通过顺序向后消除(SBE)和顺序向前选择(SFS)来评估排序后的变量。CR-FS的实现需要两个主要输入,即起始数和终止数。起始数是SBE中排名靠前的变量数量,而终止数是SFS阶段停止搜索其他特征的点。这些关键参数的设置一直依赖于反复试验,这在所得结果中引入了主观性。已知起始数和终止数会因每个数据集而异。从重叠系数(一种比较两个概率密度函数的方法)中获得灵感,开发了用于估计数据集起始数和终止数的经验方程。经验方程中的所有参数都是通过比较两个概率密度函数获得的,除了常数d。使用三个真实世界的数据集对这些方程进行了优化。确定d的最佳范围为0.48至0.57。使用两个新数据集对CR-FS的实现证明了这种方法的有效性。使用此方法计算的起始数和终止数,两个数据集的偏最小二乘判别分析(PLS-DA)模型预测准确率从90%和96%提高到了100%。此外,在前两个主成分中捕获的解释方差增加了两倍。图形摘要在这里,我们描述了如何为自动特征选择例程确定起始数和终止数,确保您以最小的努力为数据获得最佳模型。

相似文献

2
Computational advances of tumor marker selection and sample classification in cancer proteomics.
Comput Struct Biotechnol J. 2020 Jul 17;18:2012-2025. doi: 10.1016/j.csbj.2020.07.009. eCollection 2020.
4
A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.
Anal Chim Acta. 2015 Jun 16;879:10-23. doi: 10.1016/j.aca.2015.02.012. Epub 2015 Feb 11.
6
Random KNN feature selection - a fast and stable alternative to Random Forests.
BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.
8
A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction.
Comput Intell Neurosci. 2021 Nov 24;2021:5069016. doi: 10.1155/2021/5069016. eCollection 2021.
10
Feature selection based on distance correlation: a filter algorithm.
J Appl Stat. 2020 Sep 7;49(2):411-426. doi: 10.1080/02664763.2020.1815672. eCollection 2022.

本文引用的文献

1
Comprehensive two-dimensional gas chromatographic profiling and chemometric interpretation of the volatile profiles of sweat in knit fabrics.
Anal Bioanal Chem. 2017 Mar;409(7):1905-1913. doi: 10.1007/s00216-016-0137-1. Epub 2016 Dec 27.
4
Big Data: Astronomical or Genomical?
PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195. eCollection 2015 Jul.
5
Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis.
Forensic Sci Int. 2015 Jul;252:177-86. doi: 10.1016/j.forsciint.2015.04.035. Epub 2015 May 13.
7
DWFS: a wrapper feature selection tool based on a parallel genetic algorithm.
PLoS One. 2015 Feb 26;10(2):e0117988. doi: 10.1371/journal.pone.0117988. eCollection 2015.
8
Comprehensive two-dimensional gas chromatography and food sensory properties: potential and challenges.
Anal Bioanal Chem. 2015 Jan;407(1):169-91. doi: 10.1007/s00216-014-8248-z. Epub 2014 Oct 30.
9
Unique ion filter: a data reduction tool for GC/MS data preprocessing prior to chemometric analysis.
Anal Chem. 2014 Aug 5;86(15):7726-33. doi: 10.1021/ac501660a. Epub 2014 Jul 14.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验