Department of Computing, Bournemouth University, Poole, UK; Institute Mines Telecom Lille Douai, Douai, France.
Institute Mines Telecom Lille Douai, Douai, France.
Neural Netw. 2018 Feb;98:1-15. doi: 10.1016/j.neunet.2017.10.004. Epub 2017 Oct 27.
The classification of data streams is an interesting but also a challenging problem. A data stream may grow infinitely making it impractical for storage prior to processing and classification. Due to its dynamic nature, the underlying distribution of the data stream may change over time resulting in the so-called concept drift or the possible emergence and fading of classes, known as concept evolution. In addition, acquiring labels of data samples in a stream is admittedly expensive if not infeasible at all. In this paper, we propose a novel stream-based active learning algorithm (SAL) which is capable of coping with both concept drift and concept evolution by adapting the classification model to the dynamic changes in the stream. SAL is the first AL algorithm in the literature to explicitly take account of these concepts. Moreover, using SAL, only labels of samples that are expected to reduce the expected future error are queried. This process is done while tackling the problem of sampling bias so that samples that induce the change (i.e., drifting samples or samples coming from new classes) are queried. To efficiently implement SAL, the paper proposes the application of non-parametric Bayesian models allowing to cope with the lack of prior knowledge about the data stream. In particular, Dirichlet mixture models and the stick breaking process are adopted and adapted to meet the requirements of online learning. The empirical results obtained on real-world benchmarks demonstrate the superiority of SAL in terms of classification performance over the state-of-the-art methods using average and average class accuracy.
数据流的分类是一个有趣但具有挑战性的问题。由于数据流可能会无限增长,因此在处理和分类之前进行存储是不切实际的。由于其动态性质,数据流的基础分布可能会随时间变化,从而导致所谓的概念漂移或类别的可能出现和消失,即概念演化。此外,如果不是完全不可能的话,在流中获取数据样本的标签也是非常昂贵的。在本文中,我们提出了一种新颖的基于流的主动学习算法(SAL),该算法能够通过自适应分类模型来应对数据流中的动态变化,从而应对概念漂移和概念演化。SAL 是文献中第一个明确考虑这些概念的 AL 算法。此外,SAL 只查询那些预计会减少未来预期误差的样本标签。在解决抽样偏差问题的同时,会查询到导致变化的样本(即漂移样本或来自新类别的样本)。为了有效地实现 SAL,本文提出了应用非参数贝叶斯模型的方法,以应对缺乏数据流先验知识的问题。特别是,采用了 Dirichlet 混合模型和棒断裂过程,并对其进行了调整,以满足在线学习的要求。在真实基准上获得的实证结果表明,SAL 在分类性能方面优于使用平均和平均类精度的最新方法。