Li Ren-De, Ma Hao-Tian, Wang Zi-Yi, Guo Qiang, Liu Jian-Guo
Library and Research Center of Computer Systems Science, University of Shanghai for Science and Technology, Shanghai 200093, PR China.
School of Accountancy and Shanghai Key Laboratory of Financial Information Technology, Shanghai University of Finance and Economics, Shanghai 200433, PR China.
J Saf Sci Resil. 2020 Sep;1(1):36-43. doi: 10.1016/j.jnlssr.2020.06.005. Epub 2020 Jun 30.
Entity perception of ambiguous user comments is a critical problem of target identification for huge amount of public opinions. In this paper, a Two-Step-Matching method is proposed to identify the precise target entity from multiple entities mentioned. Firstly, potential entities are extracted by BiLSTM-CRF model and characteristic words by TF-IDF model from public comments. Secondly, the first matching is implemented between potential entities and an official business directory by Jaro-Winkler distance algorithm. Then, in order to find the precise one, an industry-characteristic dictionary is developed into the second matching process. The precise entity is identified according to the count of characteristic words matching to industry-characteristic dictionary. In addition, associated rate (global indicator) and accuracy rate (sample indicator) are defined for evaluation of matching accuracy. The results for three data sets of public opinions about major public health events show that the highest associated rate and accuracy rate arrive at 0.93 and 0.95, averagely enhanced by 32% and 30% above the case of using the first matching process alone. This framework provides the method to find the true target entity of really wanted expression from public opinions.
模糊用户评论的实体感知是海量舆情目标识别中的关键问题。本文提出一种两步匹配方法,从提及的多个实体中识别出精确的目标实体。首先,利用双向长短期记忆网络-条件随机场(BiLSTM-CRF)模型从公众评论中提取潜在实体,利用词频-逆文档频率(TF-IDF)模型提取特征词。其次,通过Jaro-Winkler距离算法在潜在实体与官方业务目录之间进行第一次匹配。然后,为了找到精确的实体,将行业特征词典引入到第二次匹配过程中。根据与行业特征词典匹配的特征词数量来识别精确实体。此外,定义关联率(全局指标)和准确率(样本指标)来评估匹配精度。关于重大公共卫生事件的三个舆情数据集的结果表明,最高关联率和准确率分别达到0.93和0.95,比仅使用第一次匹配过程的情况平均提高了32%和30%。该框架提供了一种从舆情中找到真正想要表达的真实目标实体的方法。