Wang Gaizhen, Guan Guoyu
School of Mathematics and Statistics, Northeast Normal University, Changchun 130000, China.
Key Laboratory for Applied Statistics of the MOE, School of Economics and Management, Northeast Normal University, Changchun 130000, China.
Entropy (Basel). 2020 Mar 14;22(3):335. doi: 10.3390/e22030335.
In this study, we propose a novel model-free feature screening method for ultrahigh dimensional binary features of binary classification, called weighted mean squared deviation (WMSD). Compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the binary features with probabilities near 0.5. In addition, the asymptotic properties of the proposed method are theoretically investigated under the assumption log p = o ( n ) . The number of features is practically selected by a Pearson correlation coefficient method according to the property of power-law distribution. Lastly, an empirical study of Chinese text classification illustrates that the proposed method performs well when the dimension of selected features is relatively small.
在本研究中,我们针对二分类的超高维二元特征提出了一种新颖的无模型特征筛选方法,称为加权均方偏差(WMSD)。与卡方统计量和互信息相比,WMSD为概率接近0.5的二元特征提供了更多机会。此外,在log p = o ( n ) 的假设下,从理论上研究了所提方法的渐近性质。根据幂律分布的性质,实际通过皮尔逊相关系数法选择特征数量。最后,中文文本分类的实证研究表明,当所选特征维度相对较小时,所提方法表现良好。