从不完全混合模式数据中综合统计知识。

Synthesizing statistical knowledge from incomplete mixed-mode data.

机构信息

Department of Systems Design Engineering, University of Waterloo, Waterloo, Ont. N2L 3G1, Canada.

出版信息

IEEE Trans Pattern Anal Mach Intell. 1987 Jun;9(6):796-805. doi: 10.1109/tpami.1987.4767986.

DOI:10.1109/tpami.1987.4767986

Abstract

The difficulties in analyzing and clustering (synthesizing) multivariate data of the mixed type (discrete and continuous) are largely due to: 1) nonuniform scaling in different coordinates, 2) the lack of order in nominal data, and 3) the lack of a suitable similarity measure. This paper presents a new approach which bypasses these difficulties and can acquire statistical knowledge from incomplete mixed-mode data. The proposed method adopts an event-covering approach which covers a subset of statistically relevant outcomes in the outcome space of variable-pairs. And once the covered event patterns are acquired, subsequent analysis tasks such as probabilistic inference, cluster analysis, and detection of event patterns for each cluster based on the incomplete probability scheme can be performed. There are four phases in our method: 1) the discretization of the continuous components based on a maximum entropy criterion so that the data can be treated as n-tuples of discrete-valued features; 2) the estimation of the missing values using our newly developed inference procedure; 3) the initial formation of clusters by analyzing the nearest-neighbor distance on subsets of selected samples; and 4) the reclassification of the n-tuples into more reliable clusters based on the detected interdependence relationships. For performance evaluation, experiments have been conducted using both simulated and real life data.

摘要

分析和聚类（综合）混合类型（离散和连续）的多元数据存在困难，主要是由于：1）不同坐标的非均匀缩放，2）标称数据的无序性，以及 3）缺乏合适的相似性度量。本文提出了一种新的方法，该方法绕过了这些困难，可以从不完整的混合模式数据中获取统计知识。所提出的方法采用事件覆盖方法，该方法覆盖了变量对的结果空间中统计上相关结果的子集。一旦获取了覆盖的事件模式，就可以执行后续的分析任务，例如概率推理、聚类分析以及基于不完整概率方案为每个簇检测事件模式。我们的方法有四个阶段：1）基于最大熵准则对连续分量进行离散化，以便可以将数据视为离散值特征的 n 元组；2）使用我们新开发的推理程序来估计缺失值；3）通过分析选定样本子集上的最近邻距离初步形成簇；4）基于检测到的相互依赖关系，将 n 元组重新分类为更可靠的簇。为了进行性能评估，使用模拟和实际生活数据进行了实验。