Zhang Yihong, Shirakawa Masumi, Hara Takahiro
Graduate School of Information Science and Technology, Multimedia Data Engineering Lab, Osaka University, Osaka, Japan.
J Intell Inf Syst. 2023;60(1):73-95. doi: 10.1007/s10844-022-00730-8. Epub 2022 Jul 29.
Given the recent availability of large volumes of social media discussions, finding temporal unusual phenomena, which can be called events, from such data is of great interest. Previous works on social media event detection either assume a specific type of event, or assume certain behavior of observed variables. In this paper, we propose a general method for event detection on social media that makes few assumptions. The main assumption we make is that when an event occurs, affected semantic aspects will behave differently from their usual behavior, for a sustained period. We generalize the representation of time units based on word embeddings of social media text, and propose an algorithm to detect durative events in time series in a general sense. In addition, we also provide an incremental version of the algorithm for the purpose of real-time detection. We test our approaches on synthetic data and two real-world tasks. With the synthetic dataset, we compare the performance of retrospective and incremental versions of the algorithm. In the first real-world task, we use a novel setting to test if our method and baseline methods can exhaustively catch all real-world news in the test period. The evaluation results show that when the event is quite unusual with regard to the base social media discussion, it can be captured more effectively with our method. In the second real-world task, we use the event captured to help improve the accuracy of stock market movement prediction. We show that our event-based approach has a clear advantage compared to other ways of adding social media information.
鉴于近期大量社交媒体讨论数据的可得性,从这类数据中发现可称为事件的时间异常现象极具吸引力。以往关于社交媒体事件检测的工作要么假定特定类型的事件,要么假定观测变量的特定行为。在本文中,我们提出一种用于社交媒体事件检测的通用方法,该方法几乎不做假设。我们做出的主要假设是,当事件发生时,受影响的语义方面在一段持续时间内的行为将与其正常行为不同。我们基于社交媒体文本的词嵌入对时间单位的表示进行了推广,并提出一种算法来从一般意义上检测时间序列中的持续性事件。此外,为了进行实时检测,我们还提供了该算法的增量版本。我们在合成数据和两个真实世界任务上测试了我们的方法。通过合成数据集,我们比较了算法的回顾性版本和增量版本的性能。在第一个真实世界任务中,我们使用一种新颖的设置来测试我们的方法和基线方法是否能详尽地捕捉测试期内的所有真实世界新闻。评估结果表明,当事件相对于基础社交媒体讨论非常异常时,我们的方法能更有效地捕捉到它。在第二个真实世界任务中,我们使用捕捉到的事件来帮助提高股票市场走势预测的准确性。我们表明,与添加社交媒体信息的其他方式相比,我们基于事件的方法具有明显优势。