Hu Weiming, Wu Ou, Chen Zhouyao, Fu Zhouyu, Maybank Steve
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P.R. China.
IEEE Trans Pattern Anal Mach Intell. 2007 Jun;29(6):1019-34. doi: 10.1109/TPAMI.2007.1133.
With the rapid development of the World Wide Web, people benefit more and more from the sharing of information. However, Web pages with obscene, harmful, or illegal content can be easily accessed. It is important to recognize such unsuitable, offensive, or pornographic Web pages. In this paper, a novel framework for recognizing pornographic Web pages is described. A C4.5 decision tree is used to divide Web pages, according to content representations, into continuous text pages, discrete text pages, and image pages. These three categories of Web pages are handled, respectively, by a continuous text classifier, a discrete text classifier, and an algorithm that fuses the results from the image classifier and the discrete text classifier. In the continuous text classifier, statistical and semantic features are used to recognize pornographic texts. In the discrete text classifier, the naive Bayes rule is used to calculate the probability that a discrete text is pornographic. In the image classifier, the object's contour-based features are extracted to recognize pornographic images. In the text and image fusion algorithm, the Bayes theory is used to combine the recognition results from images and texts. Experimental results demonstrate that the continuous text classifier outperforms the traditional keyword-statistics-based classifier, the contour-based image classifier outperforms the traditional skin-region-based image classifier, the results obtained by our fusion algorithm outperform those by either of the individual classifiers, and our framework can be adapted to different categories of Web pages.
随着万维网的迅速发展,人们越来越多地从信息共享中受益。然而,含有淫秽、有害或非法内容的网页却很容易被访问。识别此类不合适、令人反感或色情的网页非常重要。本文描述了一种用于识别色情网页的新颖框架。使用C4.5决策树根据内容表示将网页分为连续文本页面、离散文本页面和图像页面。这三类网页分别由连续文本分类器、离散文本分类器以及一种融合图像分类器和离散文本分类器结果的算法来处理。在连续文本分类器中,利用统计和语义特征来识别色情文本。在离散文本分类器中,使用朴素贝叶斯规则来计算离散文本为色情内容的概率。在图像分类器中,提取基于对象轮廓的特征来识别色情图像。在文本和图像融合算法中,使用贝叶斯理论来组合图像和文本的识别结果。实验结果表明,连续文本分类器优于传统的基于关键词统计的分类器,基于轮廓的图像分类器优于传统的基于皮肤区域的图像分类器,我们的融合算法所获得的结果优于任何一个单独的分类器,并且我们的数据框架可以适用于不同类别的网页。