Chen Zhong, Fang Zhide, Fan Wei, Edwards Andrea, Zhang Kun
Department of Computer Science, Xavier University of Louisiana.
Department of Biostatistics, School of Public Health, LSU Health Sciences Center.
SIAM Rev Soc Ind Appl Math. 2017 Apr;2017:759-767. doi: 10.1137/1.9781611974973.85.
Sparse online learning and cost-sensitive learning are two important areas of machine learning and data mining research. Each has been well studied with many interesting algorithms developed. However, very limited published work addresses the joint study of these two fields. In this paper, to tackle the high-dimensional data streams with skewed distributions, we introduce a framework of cost-sensitive sparse online learning. Our proposed framework is a substantial extension of the influential Truncated Gradient (TG) method by formulating a new convex optimization problem, where the two mutual restraint factors, misclassification cost and sparsity, can be simultaneously and favorably balanced. We theoretically analyze the regret and cost bounds of the proposed algorithm, and pinpoint its theoretical merit compared to the existing related approaches. Large-scale empirical comparisons to five baseline methods on eight real-world streaming datasets demonstrate the encouraging performance of the developed method. Algorithm implementation and datasets are available upon request.
稀疏在线学习和成本敏感学习是机器学习和数据挖掘研究的两个重要领域。每个领域都得到了充分研究,开发出了许多有趣的算法。然而,已发表的关于这两个领域联合研究的工作非常有限。在本文中,为了处理具有偏态分布的高维数据流,我们引入了一个成本敏感稀疏在线学习框架。我们提出的框架是对有影响力的截断梯度(TG)方法的实质性扩展,通过制定一个新的凸优化问题,其中误分类成本和稀疏性这两个相互约束的因素可以同时且良好地得到平衡。我们从理论上分析了所提算法的遗憾值和成本界,并指出其与现有相关方法相比的理论优点。在八个真实世界的流数据集上与五种基线方法进行的大规模实证比较证明了所开发方法令人鼓舞的性能。算法实现和数据集可应要求提供。