IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):358-364. doi: 10.1109/TCBB.2018.2879504. Epub 2018 Nov 5.
An important question in microbiology is whether treatment causes changes in gut flora, and whether it also affects metabolism. The reconstruction of causal relations purely from non-temporal observational data is challenging. We address the problem of causal inference in a bivariate case, where the joint distribution of two variables is observed. We consider, in particular, data on discrete domains. The state-of-the-art causal inference methods for continuous data suffer from high computational complexity. Some modern approaches are not suitable for categorical data, and others need to estimate and fix multiple hyper-parameters. In this contribution, we introduce a novel method of causal inference which is based on the widely used assumption that if X causes Y, then P(X) and P(Y|X) are independent. We propose to explore a semi-supervised approach where P(Y|X) and P(X) are estimated from labeled and unlabeled data respectively, whereas the marginal probability is estimated potentially from much more (cheap unlabeled) data than the conditional distribution. We validate the proposed method on the standard cause-effect pairs. We illustrate by experiments on several benchmarks of biological network reconstruction that the proposed approach is very competitive in terms of computational time and accuracy compared to the state-of-the-art methods. Finally, we apply the proposed method to an original medical task where we study whether drugs confound human metagenome.
微生物学中的一个重要问题是治疗是否会改变肠道菌群,以及它是否会影响新陈代谢。仅从非时间观测数据重建因果关系具有挑战性。我们在双变量情况下解决因果推理问题,其中观察到两个变量的联合分布。我们特别考虑离散域的数据。用于连续数据的最新因果推理方法存在计算复杂度高的问题。一些现代方法不适合分类数据,其他方法需要估计和固定多个超参数。在本研究中,我们引入了一种新的因果推理方法,该方法基于广泛使用的假设,即如果 X 导致 Y,则 P(X)和 P(Y|X)是独立的。我们建议探索一种半监督方法,其中 P(Y|X)和 P(X)分别从标记和未标记数据中进行估计,而边际概率可能是从比条件分布更多(廉价的未标记)数据中进行估计的。我们在标准因果对中验证了所提出的方法。我们通过对生物网络重建的几个基准的实验表明,与最新方法相比,所提出的方法在计算时间和准确性方面具有很强的竞争力。最后,我们将所提出的方法应用于一项原始的医学任务,研究药物是否会干扰人类宏基因组。