Kapusta Jozef, Drlik Martin, Munk Michal
Department of Informatics, Constantine the Philosopher University in Nitra, Nitra, Slovakia.
Science and Research Centre, University of Pardubice, Pardubice, Czech Republic.
PeerJ Comput Sci. 2021 Jul 19;7:e624. doi: 10.7717/peerj-cs.624. eCollection 2021.
Research of the techniques for effective fake news detection has become very needed and attractive. These techniques have a background in many research disciplines, including morphological analysis. Several researchers stated that simple content-related n-grams and POS tagging had been proven insufficient for fake news classification. However, they did not realise any empirical research results, which could confirm these statements experimentally in the last decade. Considering this contradiction, the main aim of the paper is to experimentally evaluate the potential of the common use of n-grams and POS tags for the correct classification of fake and true news. The dataset of published fake or real news about the current Covid-19 pandemic was pre-processed using morphological analysis. As a result, n-grams of POS tags were prepared and further analysed. Three techniques based on POS tags were proposed and applied to different groups of n-grams in the pre-processing phase of fake news detection. The n-gram size was examined as the first. Subsequently, the most suitable depth of the decision trees for sufficient generalization was scoped. Finally, the performance measures of models based on the proposed techniques were compared with the standardised reference TF-IDF technique. The performance measures of the model like accuracy, precision, recall and f1-score are considered, together with the 10-fold cross-validation technique. Simultaneously, the question, whether the TF-IDF technique can be improved using POS tags was researched in detail. The results showed that the newly proposed techniques are comparable with the traditional TF-IDF technique. At the same time, it can be stated that the morphological analysis can improve the baseline TF-IDF technique. As a result, the performance measures of the model, precision for fake news and recall for real news, were statistically significantly improved.
有效检测假新闻的技术研究变得非常必要且具有吸引力。这些技术有许多研究学科作为背景,包括形态分析。几位研究人员表示,简单的与内容相关的n元语法和词性标注已被证明不足以用于假新闻分类。然而,在过去十年中,他们并未实现任何实证研究结果来通过实验证实这些说法。考虑到这一矛盾,本文的主要目的是通过实验评估n元语法和词性标注的共同使用对假新闻和真实新闻正确分类的潜力。使用形态分析对已发布的关于当前新冠疫情的真假新闻数据集进行了预处理。结果,准备了词性标注的n元语法并进行了进一步分析。在假新闻检测的预处理阶段,提出了三种基于词性标注的技术并将其应用于不同的n元语法组。首先考察了n元语法的大小。随后,确定了用于充分泛化的决策树的最合适深度。最后,将基于所提出技术的模型的性能指标与标准化的参考词频-逆文档频率(TF-IDF)技术进行了比较。考虑了模型的性能指标,如准确率、精确率、召回率和F1分数,以及10折交叉验证技术。同时,详细研究了是否可以使用词性标注来改进TF-IDF技术这一问题。结果表明,新提出的技术与传统的TF-IDF技术相当。同时,可以说形态分析可以改进基线TF-IDF技术。结果,模型的性能指标,即假新闻的精确率和真实新闻的召回率,在统计上有显著提高。