College of Cybersecurity, Sichuan University, Chengdu 610065, China.
College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China.
Sensors (Basel). 2020 Jul 17;20(14):3989. doi: 10.3390/s20143989.
Pornographic and gambling websites become increasingly stubborn via disguising, misleading, blocking, and bypassing, which hinder the construction of a safe and healthy network environment. However, most traditional approaches conduct the detection process through a single aspect of these sites, which would fail to handle the more intricate and challenging situations. To alleviate this problem, this study proposed an automatic detection system for porn and gambling websites based on visual and textual content using a decision mechanism (PG-VTDM). This system can be applied to the intelligent wireless router at home or school to realize the identification, blocking, and warning of ill-suited websites. First, Doc2Vec was employed to learn the textual features that can be used to represent the textual content in the hypertext markup language (HTML) source code of the websites. In addition, the traditional bag-of-visual-words (BoVW) was improved by introducing local spatial relationships of feature points for better representing the visual features of the website screenshot. Then, based on these two types of features, a text classifier and an image classifier were both trained. In the decision mechanism, a data fusion algorithm based on logistic regression (LR) was designed to obtain the final prediction result by measuring the contribution of the two classification results to the final category prediction. The efficiency of this proposed approach was substantiated via comparison experiments using gambling and porn website datasets crawled from the Internet. The proposed approach outperformed the approach based on a single feature and some state-of-the-art approaches, with accuracy, precision, and F-measure all over 99%.
色情和赌博网站通过伪装、误导、封锁和绕过等手段变得越来越顽固,这阻碍了安全健康网络环境的建设。然而,大多数传统方法都是通过这些网站的单一方面进行检测,无法处理更复杂和具有挑战性的情况。为了解决这个问题,本研究提出了一种基于视觉和文本内容的色情和赌博网站自动检测系统,使用决策机制(PG-VTDM)。该系统可以应用于家庭或学校的智能无线路由器,实现对不适合网站的识别、封锁和警告。首先,使用 Doc2Vec 学习可以用于表示网站超文本标记语言 (HTML) 源代码中文本内容的文本特征。此外,通过引入特征点的局部空间关系来改进传统的视觉词袋 (BoVW),以更好地表示网站截图的视觉特征。然后,基于这两种类型的特征,分别训练文本分类器和图像分类器。在决策机制中,设计了一种基于逻辑回归 (LR) 的数据融合算法,通过衡量两种分类结果对最终类别预测的贡献来获得最终的预测结果。通过使用从互联网上抓取的赌博和色情网站数据集进行的比较实验,验证了该方法的效率。与基于单一特征的方法和一些最新方法相比,该方法的准确率、精度和 F 值均超过 99%。