一种用于从处理缺失值的数据库中生成增量不确定性规则的新方法：应用于动态医学数据库。

A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases.

作者信息

Konias Sokratis, Chouvarda Ioanna, Vlahavas Ioannis, Maglaveras Nicos

机构信息

Laboratory of Medical Informatics, Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece.

出版信息

Med Inform Internet Med. 2005 Sep;30(3):211-25. doi: 10.1080/14639230500209336.

DOI:10.1080/14639230500209336

PMID:16403710

Abstract

Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.

摘要

当前挖掘关联规则的方法通常假定挖掘是在静态数据库中进行的，在这种数据库中，缺失属性值的问题实际上并不存在。然而，在一些医学数据库中，比如家庭护理系统中，这些假设并不成立。本文阐述了一种新颖的不确定性规则算法，即URG-2（不确定性规则生成器），它解决了挖掘包含缺失值的动态数据库的问题。该算法仅需对初始数据集进行一次遍历即可生成项集，同时使用了与支持度和置信度概念相对应的新度量。URG-2在两个医学数据库上进行了评估，在初始数据集中为每条记录的属性随机引入多个缺失值（比率：以5%的增量从5%到20%）。与经典方法（忽略包含缺失值的记录）相比，所提出的算法在从包含缺失值的数据集中挖掘规则时更具鲁棒性。在所有情况下，保留初始规则的差异在30%到60%之间，有利于URG-2。此外，由于其增量性质，URG-2节省了超过90%的彻底重新挖掘所需的时间。因此，所提出的算法可以为动态关系数据库的挖掘提供一个更好的解决方案。