Lu Wei-Zhen, Wang Dong
Department of Building and Construction, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong.
Sci Total Environ. 2008 Jun 1;395(2-3):109-16. doi: 10.1016/j.scitotenv.2008.01.035. Epub 2008 Mar 10.
For ground-level ozone (O(3)) prediction, a predictive model, with reliable performance not only on non-polluted days but, more importantly, on polluted days, is favored by public authorities to issue alerts, so that concerned citizens and industrial organizations could take precautions to avoid exposure and reduce harmful emissions. However, the class imbalance problem, i.e., in some collected field data, number of O(3) polluted days are much smaller than that of non-polluted days, will deteriorate the model performance on minority class-O(3) polluted days. Despite support vector machine (SVM) obtaining promising results in air quality prediction, in this study, a cost-sensitive classification scheme is proposed for the standard support vector classification model (S-SVC) in order to investigate whether the class imbalance plagues S-SVC. The S-SVC with such scheme is named as CS-SVC. Experiments on imbalanced data sets collected from two air quality monitoring sites in Hong Kong show that 1) S-SVC is still sensitive to class imbalance problem; 2) compared with S-SVC, CS-SVC effectively avoids class imbalance problem with lower percentage of false negative on O(3) polluted days but with higher percentage of false positive on non-polluted days; 3) compared with both S-SVC and CS-SVC, support vector regression model (SVR), after converting its output to binary one, only has similar performance with S-SVC, which indicates class imbalance problem also impairs the regressor model. From point of protecting public health, CS-SVC, which less likely misses to forecast O(3) polluted days, is recommended here.
对于地面臭氧(O₃)预测而言,一种预测模型不仅在无污染日表现可靠,更重要的是在污染日也表现可靠,这受到公共当局的青睐,以便发布警报,让相关公民和工业组织能够采取预防措施以避免暴露并减少有害排放。然而,类别不平衡问题,即在一些收集的现场数据中,O₃污染日的数量远少于无污染日的数量,会使模型在少数类别——O₃污染日的性能变差。尽管支持向量机(SVM)在空气质量预测中取得了不错的结果,但在本研究中,为标准支持向量分类模型(S - SVC)提出了一种成本敏感分类方案,以研究类别不平衡是否困扰S - SVC。具有这种方案的S - SVC被命名为CS - SVC。对从香港两个空气质量监测站点收集的不平衡数据集进行的实验表明:1)S - SVC对类别不平衡问题仍然敏感;2)与S - SVC相比,CS - SVC有效避免了类别不平衡问题,在O₃污染日的假阴性百分比更低,但在无污染日的假阳性百分比更高;3)与S - SVC和CS - SVC相比,支持向量回归模型(SVR)在将其输出转换为二元输出后,仅具有与S - SVC相似的性能,这表明类别不平衡问题也损害了回归模型。从保护公众健康的角度出发,这里推荐CS - SVC,因为它不太可能漏报O₃污染日。