Suppr超能文献

基于贝塔分布的交叉熵用于特征选择。

Beta Distribution-Based Cross-Entropy for Feature Selection.

作者信息

Dai Weixing, Guo Dianjing

机构信息

School of Life Science and State Key Laboratory of Agrobiotechnology, G94, Science Center South Block, The Chinese University of Hong Kong, Shatin 999077, Hong Kong, China.

出版信息

Entropy (Basel). 2019 Aug 7;21(8):769. doi: 10.3390/e21080769.

Abstract

Analysis of high-dimensional data is a challenge in machine learning and data mining. Feature selection plays an important role in dealing with high-dimensional data for improvement of predictive accuracy, as well as better interpretation of the data. Frequently used evaluation functions for feature selection include resampling methods such as cross-validation, which show an advantage in predictive accuracy. However, these conventional methods are not only computationally expensive, but also tend to be over-optimistic. We propose a novel cross-entropy which is based on beta distribution for feature selection. In beta distribution-based cross-entropy (BetaDCE) for feature selection, the probability density is estimated by beta distribution and the cross-entropy is computed by the expected value of beta distribution, so that the generalization ability can be estimated more precisely than conventional methods where the probability density is learnt from data. Analysis of the generalization ability of BetaDCE revealed that it was a trade-off between bias and variance. The robustness of BetaDCE was demonstrated by experiments on three types of data. In the exclusive or-like (XOR-like) dataset, the false discovery rate of BetaDCE was significantly smaller than that of other methods. For the leukemia dataset, the area under the curve (AUC) of BetaDCE on the test set was 0.93 with only four selected features, which indicated that BetaDCE not only detected the irrelevant and redundant features precisely, but also more accurately predicted the class labels with a smaller number of features than the original method, whose AUC was 0.83 with 50 features. In the metabonomic dataset, the overall AUC of prediction with features selected by BetaDCE was significantly larger than that by the original reported method. Therefore, BetaDCE can be used as a general and efficient framework for feature selection.

摘要

高维数据的分析是机器学习和数据挖掘中的一项挑战。特征选择在处理高维数据以提高预测准确性以及更好地解释数据方面发挥着重要作用。常用于特征选择的评估函数包括重采样方法,如交叉验证,其在预测准确性方面具有优势。然而,这些传统方法不仅计算成本高,而且往往过于乐观。我们提出了一种基于贝塔分布的用于特征选择的新型交叉熵。在基于贝塔分布的交叉熵(BetaDCE)用于特征选择时,概率密度由贝塔分布估计,交叉熵由贝塔分布的期望值计算,这样与从数据中学习概率密度的传统方法相比,泛化能力可以得到更精确的估计。对BetaDCE泛化能力的分析表明,它是偏差和方差之间的一种权衡。通过对三种类型数据的实验证明了BetaDCE的鲁棒性。在异或类(XOR-like)数据集中,BetaDCE的错误发现率明显低于其他方法。对于白血病数据集,BetaDCE在测试集上仅使用四个选定特征时的曲线下面积(AUC)为0.93,这表明BetaDCE不仅能精确检测出不相关和冗余的特征,而且与原始方法相比,用更少的特征就能更准确地预测类别标签,原始方法在使用50个特征时的AUC为0.83。在代谢组学数据集中,使用BetaDCE选择的特征进行预测的总体AUC明显大于原始报道方法。因此,BetaDCE可以用作特征选择的通用且高效的框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4010/7515297/9702b892638a/entropy-21-00769-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验