Suppr超能文献

SEMbap:无弓协方差搜索和数据去相关。

SEMbap: Bow-free covariance search and data de-correlation.

机构信息

Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.

出版信息

PLoS Comput Biol. 2024 Sep 11;20(9):e1012448. doi: 10.1371/journal.pcbi.1012448. eCollection 2024 Sep.

Abstract

Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called SEMbap(). In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.

摘要

大规模的基因表达研究通常受到表达变异的生物和技术来源的影响,包括批次效应、样本特征和环境影响。在存在未观察到的混杂因素的情况下,学习可观察变量之间的因果关系可能具有挑战性。此外,许多高维回归技术的性能可能会更差。事实上,控制未观察到的混杂变量是至关重要的,已经提出了许多去混杂方法来应用于各种情况。本文的主要贡献是开发了一种基于无向自由路径(BAP)搜索的两阶段去混杂程序,并将其纳入结构方程模型(SEM)的框架中,称为 SEMbap()。在第一阶段,通过 Shipley d-分离测试对具有显著协方差的缺失边进行全面搜索;然后,在第二阶段,通过约束高斯图模型(CGGM)拟合或通过图拉普拉斯主成分分析(gLPCA)获得无向边结构的低维表示。我们将 BAP 搜索方法与四种流行的去混杂方法进行比较,并将其应用于模拟和观察到的表达数据。在前者中,复制了隐藏协方差矩阵的不同结构。与现有方法相比,BAP 搜索算法能够正确识别隐藏的混杂因素,同时控制假阳性率并实现良好的拟合和扰动度量。

相似文献

1
SEMbap: Bow-free covariance search and data de-correlation.
PLoS Comput Biol. 2024 Sep 11;20(9):e1012448. doi: 10.1371/journal.pcbi.1012448. eCollection 2024 Sep.
2
Regularized estimation of large-scale gene association networks using graphical Gaussian models.
BMC Bioinformatics. 2009 Nov 24;10:384. doi: 10.1186/1471-2105-10-384.
4
Learning genetic and environmental graphical models from family data.
Stat Med. 2020 Aug 15;39(18):2403-2422. doi: 10.1002/sim.8545. Epub 2020 Apr 28.
5
Robust Gaussian graphical modeling via l1 penalization.
Biometrics. 2012 Dec;68(4):1197-206. doi: 10.1111/j.1541-0420.2012.01785.x. Epub 2012 Sep 28.
6
A novel approach to the clustering of microarray data via nonparametric density estimation.
BMC Bioinformatics. 2011 Feb 8;12:49. doi: 10.1186/1471-2105-12-49.
8
Differential correlation for sequencing data.
BMC Res Notes. 2017 Jan 19;10(1):54. doi: 10.1186/s13104-016-2331-9.
9
Biological network inference using low order partial correlation.
Methods. 2014 Oct 1;69(3):266-73. doi: 10.1016/j.ymeth.2014.06.010. Epub 2014 Jul 5.
10
Causal discoveries for high dimensional mixed data.
Stat Med. 2022 Oct 30;41(24):4924-4940. doi: 10.1002/sim.9544. Epub 2022 Aug 15.

本文引用的文献

1
Gaussian graphical models with applications to omics analyses.
Stat Med. 2022 Nov 10;41(25):5150-5187. doi: 10.1002/sim.9546. Epub 2022 Sep 26.
2
SEMgraph: an R package for causal network inference of high-throughput data with structural equation models.
Bioinformatics. 2022 Oct 14;38(20):4829-4830. doi: 10.1093/bioinformatics/btac567.
3
DOUBLY DEBIASED LASSO: HIGH-DIMENSIONAL INFERENCE UNDER HIDDEN CONFOUNDING.
Ann Stat. 2022 Jun;50(3):1320-1347. doi: 10.1214/21-aos2152. Epub 2022 Jun 16.
4
Identifying cancer pathway dysregulations using differential causal effects.
Bioinformatics. 2022 Mar 4;38(6):1550-1559. doi: 10.1093/bioinformatics/btab847.
5
A data-driven approach to measuring epidemiological susceptibility risk around the world.
Sci Rep. 2021 Dec 15;11(1):24037. doi: 10.1038/s41598-021-03322-8.
6
Multiomic Integration of Public Oncology Databases in Bioconductor.
JCO Clin Cancer Inform. 2020 Oct;4:958-971. doi: 10.1200/CCI.19.00119.
7
Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation.
Biostatistics. 2020 Jan 1;21(1):15-32. doi: 10.1093/biostatistics/kxy029.
8
Estimation of Directed Acyclic Graphs Through Two-stage Adaptive Lasso for Gene Network Inference.
J Am Stat Assoc. 2016;111(515):1004-1019. doi: 10.1080/01621459.2016.1142880. Epub 2016 Oct 18.
10
TRRUST: a reference database of human transcriptional regulatory interactions.
Sci Rep. 2015 Jun 12;5:11432. doi: 10.1038/srep11432.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验