Kurgan Lukasz A, Cios Krzysztof J, Dick Scott
Department of Electrical and Computer Engineering, University of Alberta, Edmonton AB T6G 2VF, Canada.
IEEE Trans Syst Man Cybern B Cybern. 2006 Feb;36(1):32-53. doi: 10.1109/tsmcb.2005.852983.
Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.
商业智能和生物信息学应用越来越需要挖掘由数百万个数据点组成的数据集,或者为大型公司和制药公司构建实时企业级决策支持系统。在所有这些情况下,都需要一个基础的数据挖掘系统,并且这个挖掘系统必须具有高度的可扩展性。为此,我们描述了一种名为DataSqueezer的新规则学习器。该学习器属于归纳监督规则提取算法家族。DataSqueezer是一个简单、贪婪的规则构建器,它从带标签的输入数据中生成一组生产规则。尽管相对简单,但DataSqueezer是一个非常有效的学习器。该算法生成的规则紧凑、易懂,并且与其他最先进的规则提取算法生成的规则具有相当的准确性。DataSqueezer的主要优点是效率非常高以及抗缺失数据。DataSqueezer随着训练示例数量的增加呈现对数线性渐近复杂度,并且比其他最先进的规则学习器更快。通过与其他学习器的广泛实验比较验证,该学习器对大量缺失数据也具有鲁棒性。因此,DataSqueezer非常适合现代数据挖掘和商业智能任务,这些任务通常涉及包含大量缺失数据的巨大数据集。