Grassi Mario, Tarantino Barbara
Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.
PLoS One. 2025 Jan 8;20(1):e0317283. doi: 10.1371/journal.pone.0317283. eCollection 2025.
A Directed Acyclic Graph (DAG) offers an easy approach to define causal structures among gathered nodes: causal linkages are represented by arrows between the variables, leading from cause to effect. Recently, industry and academics have paid close attention to DAG structure learning from observable data, and many techniques have been put out to address the problem. We provide a two-step approach, named SEMdag(), that can be used to quickly learn high-dimensional linear SEMs. It is included in the R package SEMgraph and employs a two-stage order-based search using previous knowledge (Knowledge-based, KB) or data-driven method (Bottom-up, BU), under the premise that a linear SEM with equal variance error terms is assumed. We evaluated our framework's for finding plausible DAGs against six well-known causal discovery techniques (ARGES, GES, PC, LiNGAM, CAM, NOTEARS). We conducted a series of experiments using observed expression (or RNA-seq) data, taking into account a pair of training and testing datasets for four distinct diseases: Amyotrophic Lateral Sclerosis (ALS), Breast cancer (BRCA), Coronavirus disease (COVID-19) and ST-elevation myocardial infarction (STEMI). The results show that the SEMdag() procedure can recover a graph structure with good disease prediction performance evaluated by a conventional supervised learning algorithm (RF): in the scenario where the initial graph is sparse, the BU approach may be a better choice than the KB one; in the case where the graph is denser, both BU an KB report high performance, with highest score for KB approach based on topological layers. Besides its superior disease predictive performance compared to previous research, SEMdag() offers the user the flexibility to define distinct structure learning algorithms and can handle high dimensional issues with less computing load. SEMdag() function is implemented in the R package SEMgraph, easily available at https://CRAN.R-project.org/package=SEMgraph.
有向无环图(DAG)为定义收集到的节点之间的因果结构提供了一种简单的方法:因果联系由变量之间的箭头表示,箭头从原因指向结果。最近,业界和学术界都密切关注从可观测数据中学习DAG结构,并且已经提出了许多技术来解决这个问题。我们提供了一种名为SEMDag()的两步法,可用于快速学习高维线性结构方程模型(SEM)。它包含在R包SEMGraph中,并在假设误差项方差相等的线性SEM的前提下,采用基于先验知识(基于知识,KB)或数据驱动方法(自下而上,BU)的两阶段顺序搜索。我们针对六种著名的因果发现技术(ARGES、GES、PC、LiNGAM、CAM、NOTEARS)评估了我们用于寻找合理DAG的框架。我们使用观察到的表达(或RNA测序)数据进行了一系列实验,考虑了四种不同疾病的一对训练和测试数据集:肌萎缩侧索硬化症(ALS)、乳腺癌(BRCA)、冠状病毒病(COVID-19)和ST段抬高型心肌梗死(STEMI)。结果表明,SEMDag()程序可以恢复具有良好疾病预测性能的图结构,该性能由传统监督学习算法(随机森林,RF)评估:在初始图稀疏的情况下,BU方法可能比KB方法是更好的选择;在图更密集的情况下,BU和KB都报告了高性能,基于拓扑层的KB方法得分最高。除了与先前研究相比具有卓越的疾病预测性能外,SEMDag()还为用户提供了定义不同结构学习算法的灵活性,并且可以以较少的计算负载处理高维问题。SEMDag()函数在R包SEMGraph中实现,可在https://CRAN.R-project.org/package=SEMgraph上轻松获取。