Zhao Juan, Zhou Yiwei, Zhang Xiujun, Chen Luonan
Key Laboratory of Systems Biology, Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of the Chinese Academy of Sciences, Shanghai 200031, China;
Key Laboratory of Systems Biology, Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of the Chinese Academy of Sciences, Shanghai 200031, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 200031, China;
Proc Natl Acad Sci U S A. 2016 May 3;113(18):5130-5. doi: 10.1073/pnas.1522586113. Epub 2016 Apr 18.
Quantitatively identifying direct dependencies between variables is an important task in data analysis, in particular for reconstructing various types of networks and causal relations in science and engineering. One of the most widely used criteria is partial correlation, but it can only measure linearly direct association and miss nonlinear associations. However, based on conditional independence, conditional mutual information (CMI) is able to quantify nonlinearly direct relationships among variables from the observed data, superior to linear measures, but suffers from a serious problem of underestimation, in particular for those variables with tight associations in a network, which severely limits its applications. In this work, we propose a new concept, "partial independence," with a new measure, "part mutual information" (PMI), which not only can overcome the problem of CMI but also retains the quantification properties of both mutual information (MI) and CMI. Specifically, we first defined PMI to measure nonlinearly direct dependencies between variables and then derived its relations with MI and CMI. Finally, we used a number of simulated data as benchmark examples to numerically demonstrate PMI features and further real gene expression data from Escherichia coli and yeast to reconstruct gene regulatory networks, which all validated the advantages of PMI for accurately quantifying nonlinearly direct associations in networks.
定量识别变量之间的直接依赖关系是数据分析中的一项重要任务,特别是在科学和工程领域重建各种类型的网络和因果关系时。最广泛使用的标准之一是偏相关,但它只能测量线性直接关联,而会遗漏非线性关联。然而,基于条件独立性,条件互信息(CMI)能够从观测数据中量化变量之间的非线性直接关系,优于线性度量,但存在严重的低估问题,特别是对于网络中具有紧密关联的那些变量,这严重限制了其应用。在这项工作中,我们提出了一个新的概念“部分独立性”,以及一种新的度量“部分互信息”(PMI),它不仅可以克服CMI的问题,还保留了互信息(MI)和CMI的量化特性。具体而言,我们首先定义了PMI来测量变量之间的非线性直接依赖关系,然后推导了它与MI和CMI的关系。最后,我们使用了一些模拟数据作为基准示例,从数值上展示了PMI的特征,并进一步使用来自大肠杆菌和酵母的真实基因表达数据来重建基因调控网络,所有这些都验证了PMI在准确量化网络中非线性直接关联方面的优势。