Adutwum Lawrence A, de la Mata A Paulina, Bean Heather D, Hill Jane E, Harynuk James J
Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, Alberta, T6G 2G2, Canada.
School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ, 85287, USA.
Anal Bioanal Chem. 2017 Nov;409(28):6699-6708. doi: 10.1007/s00216-017-0628-8. Epub 2017 Sep 29.
Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. Graphical abstract Here, we describe how to determine the start and stop numbers for an automated feature selection routine, ensuring that you get the best model you can for your data with minimal effort.
聚类分辨率特征选择(CR-FS)是一种混合特征选择算法,它通过顺序向后消除(SBE)和顺序向前选择(SFS)来评估排序后的变量。CR-FS的实现需要两个主要输入,即起始数和终止数。起始数是SBE中排名靠前的变量数量,而终止数是SFS阶段停止搜索其他特征的点。这些关键参数的设置一直依赖于反复试验,这在所得结果中引入了主观性。已知起始数和终止数会因每个数据集而异。从重叠系数(一种比较两个概率密度函数的方法)中获得灵感,开发了用于估计数据集起始数和终止数的经验方程。经验方程中的所有参数都是通过比较两个概率密度函数获得的,除了常数d。使用三个真实世界的数据集对这些方程进行了优化。确定d的最佳范围为0.48至0.57。使用两个新数据集对CR-FS的实现证明了这种方法的有效性。使用此方法计算的起始数和终止数,两个数据集的偏最小二乘判别分析(PLS-DA)模型预测准确率从90%和96%提高到了100%。此外,在前两个主成分中捕获的解释方差增加了两倍。图形摘要在这里,我们描述了如何为自动特征选择例程确定起始数和终止数,确保您以最小的努力为数据获得最佳模型。