使用压缩复杂度测度进行因果发现。

Causal discovery using compression-complexity measures.

机构信息

Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru, India.

出版信息

J Biomed Inform. 2021 May;117:103724. doi: 10.1016/j.jbi.2021.103724. Epub 2021 Mar 13.

DOI:10.1016/j.jbi.2021.103724

Abstract

Causal inference is one of the most fundamental problems across all domains of science. We address the problem of inferring a causal direction from two observed discrete symbolic sequences X and Y. We present a framework which relies on lossless compressors for inferring context-free grammars (CFGs) from sequence pairs and quantifies the extent to which the grammar inferred from one sequence compresses the other sequence. We infer X causes Y if the grammar inferred from X better compresses Y than in the other direction. To put this notion to practice, we propose three models that use the Compression-Complexity Measures (CCMs) - Lempel-Ziv (LZ) complexity and Effort-To-Compress (ETC) to infer CFGs and discover causal directions without demanding temporal structures. We evaluate these models on synthetic and real-world benchmarks and empirically observe performances competitive with current state-of-the-art methods. Lastly, we present two unique applications of the proposed models for causal inference directly from pairs of genome sequences belonging to the SARS-CoV-2 virus. Using numerous sequences, we show that our models capture causal information exchanged between genome sequence pairs, presenting novel opportunities for addressing key issues in sequence analysis to investigate the evolution of virulence and pathogenicity in future applications.

摘要

因果推断是所有科学领域中最基本的问题之一。我们解决了从两个观察到的离散符号序列 X 和 Y 推断因果方向的问题。我们提出了一个框架，该框架依赖于无损压缩器从序列对中推断上下文无关文法（CFG），并量化从一个序列推断出的文法对另一个序列的压缩程度。如果从 X 推断出的语法比在另一个方向上更好地压缩 Y，则推断 X 导致 Y。为了将这个概念付诸实践，我们提出了三个模型，这些模型使用压缩复杂度度量（CCM）-Lempel-Ziv（LZ）复杂度和压缩努力（ETC）来推断 CFG 并发现因果方向，而无需要求时间结构。我们在合成和真实世界基准上评估这些模型，并经验性地观察到与当前最先进方法相当的性能。最后，我们提出了两个从属于 SARS-CoV-2 病毒的基因组序列对直接进行因果推断的提议模型的独特应用。使用大量序列，我们表明我们的模型捕获了在基因组序列对之间交换的因果信息，为解决序列分析中的关键问题提供了新的机会，以研究未来应用中毒力和致病性的演变。