IEEE Trans Cybern. 2018 Sep;48(9):2697-2711. doi: 10.1109/TCYB.2017.2748598. Epub 2017 Sep 18.
Short text streams such as search snippets and micro blogs have been popular on the Web with the emergence of social media. Unlike traditional normal text streams, these data present the characteristics of short length, weak signal, high volume, high velocity, topic drift, etc. Short text stream classification is hence a very challenging and significant task. However, this challenge has received little attention from the research community. Therefore, a new feature extension approach is proposed for short text stream classification with the help of a large-scale semantic network obtained from a Web corpus. It is built on an incremental ensemble classification model for efficiency. First, more semantic contexts based on the senses of terms in short texts are introduced to make up of the data sparsity using the open semantic network, in which all terms are disambiguated by their semantics to reduce the noise impact. Second, a concept cluster-based topic drifting detection method is proposed to effectively track hidden topic drifts. Finally, extensive studies demonstrate that as compared to several well-known concept drifting detection methods in data stream, our approach can detect topic drifts effectively, and it enables handling short text streams effectively while maintaining the efficiency as compared to several state-of-the-art short text classification approaches.
随着社交媒体的出现,短文本流(如搜索片段和微博)在网络上变得非常流行。与传统的正常文本流不同,这些数据具有短长度、弱信号、高数量、高速度、主题漂移等特点。因此,短文本流分类是一项非常具有挑战性和重要的任务。然而,这个挑战并没有引起研究界的太多关注。因此,提出了一种新的特征扩展方法,用于使用从 Web 语料库获得的大规模语义网络进行短文本流分类。它建立在一个增量集成分类模型之上,以提高效率。首先,使用开放语义网络引入更多基于短文本中术语含义的语义上下文,以填补数据稀疏性,其中所有术语都通过语义进行消歧,以减少噪声影响。其次,提出了一种基于概念聚类的主题漂移检测方法,以有效地跟踪隐藏的主题漂移。最后,广泛的研究表明,与数据流中的几种著名的概念漂移检测方法相比,我们的方法可以有效地检测主题漂移,并且与几种最新的短文本分类方法相比,它可以有效地处理短文本流并保持效率。