Department of Brain and Behavioral Sciences, University of Pavia, Pavia 27100, Italy.
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad377.
With the exponential growth of expression and protein-protein interaction (PPI) data, the identification of functional modules in PPI networks that show striking changes in molecular activity or phenotypic signatures becomes of particular interest to reveal process-specific information that is correlated with cellular or disease states. This requires both the identification of network nodes with reliability scores and the availability of an efficient technique to locate the network regions with the highest scores. In the literature, a number of heuristic methods have been suggested. We propose SEMtree(), a set of tree-based structure discovery algorithms, combining graph and statistically interpretable parameters together with a user-friendly R package based on structural equation models framework.
Condition-specific changes from differential expression and gene-gene co-expression are recovered with statistical testing of node, directed edge, and directed path difference between groups. In the end, from a list of seed (i.e. disease) genes or gene P-values, the perturbed modules with undirected edges are generated with five state-of-the-art active subnetwork detection methods. The latter are supplied to causal additive trees based on Chu-Liu-Edmonds' algorithm (Chow and Liu, Approximating discrete probability distributions with dependence trees. IEEE Trans Inform Theory 1968;14:462-7) in SEMtree() to be converted in directed trees. This conversion allows to compare the methods in terms of directed active subnetworks. We applied SEMtree() to both Coronavirus disease (COVID-19) RNA-seq dataset (GEO accession: GSE172114) and simulated datasets with various differential expression patterns. Compared to existing methods, SEMtree() is able to capture biologically relevant subnetworks with simple visualization of directed paths, good perturbation extraction, and classifier performance.
SEMtree() function is implemented in the R package SEMgraph, easily available at https://CRAN.R-project.org/package=SEMgraph.
随着表达谱和蛋白质-蛋白质相互作用(PPI)数据的指数级增长,识别 PPI 网络中显示分子活性或表型特征显著变化的功能模块,对于揭示与细胞或疾病状态相关的特定过程信息变得尤为重要。这既需要识别具有可靠性评分的网络节点,又需要提供一种有效的技术来定位具有最高评分的网络区域。在文献中,已经提出了许多启发式方法。我们提出了 SEMtree(),这是一组基于树的结构发现算法,将图和可统计解释的参数与基于结构方程模型框架的用户友好的 R 包结合在一起。
通过对节点、有向边和组间有向路径的差异进行统计检验,恢复了差异表达和基因-基因共表达的条件特异性变化。最后,从种子(即疾病)基因或基因 P 值列表中,使用五种最先进的主动子网检测方法生成具有无向边的扰动模块。后者基于 Chu-Liu-Edmonds 算法(Chow 和 Liu,使用依赖树逼近离散概率分布。IEEE Trans Inform Theory 1968;14:462-7)提供给 SEMtree()中的因果加法树,以转换为有向树。这种转换允许根据有向主动子网比较方法。我们将 SEMtree()应用于冠状病毒病(COVID-19)RNA-seq 数据集(GEO 注册号:GSE172114)和具有各种差异表达模式的模拟数据集。与现有方法相比,SEMTree()能够通过简单的有向路径可视化、良好的扰动提取和分类器性能,捕获具有生物学意义的子网。
SEMTree()函数在 R 包 SEMgraph 中实现,可在 https://CRAN.R-project.org/package=SEMgraph 上轻松获得。