Department of Statistics, University of California Davis, Davis, California, United States of America.
Department of Computer Science, University of California Davis, Davis, California, United States of America.
PLoS One. 2018 Jun 14;13(6):e0198253. doi: 10.1371/journal.pone.0198253. eCollection 2018.
Data generated from a system of interest typically consists of measurements on many covariate features and possibly multiple response features across all subjects in a designated ensemble. Such data is naturally represented by one response-matrix against one covariate-matrix. A matrix lattice is an advantageous platform for simultaneously accommodating heterogeneous data types: continuous, discrete and categorical, and exploring hidden dependency among/between features and subjects. After each feature being individually renormalized with respect to its own histogram, the categorical version of mutual conditional entropy is evaluated for all pairs of response and covariate features according to the combinatorial information theory. Then, by applying Data Could Geometry (DCG) algorithmic computations on such a mutual conditional entropy matrix, multiple synergistic feature-groups are partitioned. Distinct synergistic feature-groups embrace distinct structures of dependency. The explicit details of dependency among members of synergistic features are seen through mutliscale compositions of blocks computed by a computing paradigm called Data Mechanics. We then propose a categorical pattern matching approach to establish a directed associative linkage: from the patterned response dependency to serial structured covariate dependency. The graphic display of such a directed associative linkage is termed an information flow and the degrees of association are evaluated via tree-to-tree mutual conditional entropy. This new universal way of discovering system knowledge is illustrated through five data sets. In each case, the emergent visible heterogeneity is an organization of discovered knowledge.
从感兴趣的系统生成的数据通常由在指定总体中所有主体的许多协变量特征和可能的多个响应特征的测量值组成。这种数据通常由一个响应矩阵和一个协变量矩阵表示。矩阵格是同时容纳异构数据类型的有利平台:连续、离散和分类,并探索特征和主体之间的隐藏依赖性。在每个特征相对于其自身的直方图进行单独的重新归一化之后,根据组合信息理论,根据所有响应和协变量特征对互条件熵的分类版本进行评估。然后,通过在这样的互条件熵矩阵上应用数据云几何(DCG)算法计算,可以对多个协同特征组进行分区。不同的协同特征组包含不同的依赖结构。通过称为数据力学的计算范例计算的块的多尺度组合,可以看到协同特征成员之间的依赖的明确细节。然后,我们提出了一种分类模式匹配方法来建立有向关联链接:从有模式的响应依赖性到串行结构的协变量依赖性。这种有向关联链接的图形显示称为信息流,并通过树到树互条件熵来评估关联程度。通过五个数据集说明了这种发现系统知识的新通用方法。在每种情况下,出现的可见异质性都是发现知识的组织。