Wakamiya Shoko, Kawai Yukiko, Aramaki Eiji
Nara Institute of Science and Technology, Ikoma, Japan.
Kyoto Sangyo University, Kyoto, Japan.
JMIR Public Health Surveill. 2018 Sep 25;4(3):e65. doi: 10.2196/publichealth.8627.
The recent rise in popularity and scale of social networking services (SNSs) has resulted in an increasing need for SNS-based information extraction systems. A popular application of SNS data is health surveillance for predicting an outbreak of epidemics by detecting diseases from text messages posted on SNS platforms. Such applications share the following logic: they incorporate SNS users as social sensors. These social sensor-based approaches also share a common problem: SNS-based surveillance are much more reliable if sufficient numbers of users are active, and small or inactive populations produce inconsistent results.
This study proposes a novel approach to estimate the trend of patient numbers using indirect information covering both urban areas and rural areas within the posts.
We presented a TRAP model by embedding both direct information and indirect information. A collection of tweets spanning 3 years (7 million influenza-related tweets in Japanese) was used to evaluate the model. Both direct information and indirect information that mention other places were used. As indirect information is less reliable (too noisy or too old) than direct information, the indirect information data were not used directly and were considered as inhibiting direct information. For example, when indirect information appeared often, it was considered as signifying that everyone already had a known disease, leading to a small amount of direct information.
The estimation performance of our approach was evaluated using the correlation coefficient between the number of influenza cases as the gold standard values and the estimated values by the proposed models. The results revealed that the baseline model (BASELINE+NLP) shows .36 and that the proposed model (TRAP+NLP) improved the accuracy (.70, +.34 points).
The proposed approach by which the indirect information inhibits direct information exhibited improved estimation performance not only in rural cities but also in urban cities, which demonstrated the effectiveness of the proposed method consisting of a TRAP model and natural language processing (NLP) classification.
社交网络服务(SNS)近来在普及程度和规模上不断上升,这使得基于SNS的信息提取系统的需求日益增长。SNS数据的一个流行应用是健康监测,即通过检测SNS平台上发布的文本消息中的疾病来预测流行病的爆发。此类应用遵循以下逻辑:它们将SNS用户纳入作为社会传感器。这些基于社会传感器的方法也存在一个共同问题:如果有足够数量的用户活跃,基于SNS的监测会更可靠,而小规模或不活跃的人群会产生不一致的结果。
本研究提出一种新颖的方法,利用帖子中涵盖城市和农村地区的间接信息来估计患者数量趋势。
我们通过嵌入直接信息和间接信息提出了一种TRAP模型。使用了跨越3年的推文集合(700万条日语流感相关推文)来评估该模型。使用了提及其他地点的直接信息和间接信息。由于间接信息比直接信息可靠性更低(噪声太大或太陈旧),间接信息数据未被直接使用,而是被视为抑制直接信息。例如,当间接信息频繁出现时,它被视为意味着每个人都已经患有已知疾病,从而导致直接信息数量较少。
我们使用流感病例数作为金标准值与所提出模型的估计值之间的相关系数来评估我们方法的估计性能。结果显示基线模型(BASELINE+NLP)的相关系数为0.36,而所提出的模型(TRAP+NLP)提高了准确性(0.70,提高了0.34个百分点)。
间接信息抑制直接信息的所提出方法不仅在农村城市而且在城市城市都表现出改进的估计性能,这证明了由TRAP模型和自然语言处理(NLP)分类组成的所提出方法的有效性。