Yang Jianing, Grant Gregory R, Brooks Thomas G
Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
Chronobiology and Sleep Institute, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
PLoS Comput Biol. 2025 Sep 5;21(9):e1013392. doi: 10.1371/journal.pcbi.1013392. eCollection 2025 Sep.
Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each sample and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to computational and statistical hurdles of doing so. To alleviate this, we describe three approaches for generating omics-scale data with correlated measures which mimic real datasets. These approaches are all based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. This decomposition allows for extremely efficient simulation, overcoming a hurdle for adoption of past methods. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary marginal distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.
逼真的组学数据模拟是基准测试研究的关键输入,有助于用户获得最佳计算流程。组学数据涉及每个样本上大量的测量特征,并且这些测量通常相互关联。然而,模拟往往忽略了这些相关性,这可能是由于这样做存在计算和统计障碍。为了缓解这一问题,我们描述了三种生成具有相关测量的组学规模数据的方法,这些方法可模拟真实数据集。这些方法均基于高斯Copula方法,其协方差矩阵可分解为对角部分和低秩部分。这种分解实现了极其高效的模拟,克服了以往方法应用中的障碍。我们使用这些方法在两个基准测试应用中证明了纳入相关性的重要性。首先,我们表明当纳入依赖性时,流行的DESeq2方法结果的方差会增加。其次,我们证明了CYCLOPS(一种从转录组学推断采集昼夜时间的方法)在某些情况下给定基因-基因依赖性时性能会提高。我们提供了一个R包dependentsimr,它对这些方法有高效的实现,并且可以生成具有任意边际分布的相关数据,包括离散分布(二元、有序分类、泊松、负二项式)、连续分布(正态)或经验分布。