Department of Computer Information Technology and Graphics, Purdue University Northwest, Hammond, IN, USA.
Department of Medicine, Vanderbilt University, Nashville, TN, USA.
BMC Bioinformatics. 2018 Jun 13;19(Suppl 8):210. doi: 10.1186/s12859-018-2198-y.
As Twitter has become an active data source for health surveillance research, it is important that efficient and effective methods are developed to identify tweets related to personal health experience. Conventional classification algorithms rely on features engineered by human domain experts, and engineering such features is a challenging task and requires much human intelligence. The resultant features may not be optimal for the classification problem, and can make it challenging for conventional classifiers to correctly predict personal experience tweets (PETs) due to the various ways to express and/or describe personal experience in tweets. In this study, we developed a method that combines word embedding and long short-term memory (LSTM) model without the need to engineer any specific features. Through word embedding, tweet texts were represented as dense vectors which in turn were fed to the LSTM neural network as sequences.
Statistical analyses of the results of 10-fold cross-validations of our method and conventional methods indicate that there exist significant differences (p < 0.01) in performance measures of accuracy, precision, recall, F1-score, and ROC/AUC, demonstrating that our approach outperforms the conventional methods in identifying PETs.
We presented an efficient and effective method of identifying health-related personal experience tweets by combining word embedding and an LSTM neural network. It is conceivable that our method can help accelerate and scale up analyzing textual data of social media for health surveillance purposes, because of no need for the laborious and costly process of engineering features.
随着 Twitter 成为健康监测研究的活跃数据源,开发有效的方法来识别与个人健康体验相关的推文变得尤为重要。传统的分类算法依赖于人类领域专家设计的特征,而设计这些特征是一项具有挑战性的任务,需要大量的人类智慧。由此产生的特征可能不是分类问题的最佳选择,由于在推文中表达和/或描述个人体验的方式多种多样,这使得传统分类器难以正确预测个人体验推文 (PETs)。在这项研究中,我们开发了一种结合词嵌入和长短期记忆 (LSTM) 模型的方法,无需设计任何特定特征。通过词嵌入,推文文本被表示为密集向量,然后作为序列输入到 LSTM 神经网络中。
对我们的方法和传统方法的 10 折交叉验证结果的统计分析表明,在准确性、精度、召回率、F1 得分和 ROC/AUC 等性能指标上存在显著差异 (p < 0.01),这表明我们的方法在识别 PETs 方面优于传统方法。
我们通过结合词嵌入和 LSTM 神经网络,提出了一种有效识别与健康相关的个人体验推文的方法。可以想象,由于无需进行繁琐且昂贵的特征设计过程,我们的方法可以帮助加速和扩大社交媒体文本数据的分析,用于健康监测目的。