IMAG, Université de Montpellier, CNRS, Montpellier, France.
Friedrich Miescher Institute for Biomedical Research, 4058 Basel, Switzerland.
Mol Biol Evol. 2023 Jan 4;40(1). doi: 10.1093/molbev/msac269.
Interspecies RNA-Seq datasets are increasingly common, and have the potential to answer new questions about the evolution of gene expression. Single-species differential expression analysis is now a well-studied problem that benefits from sound statistical methods. Extensive reviews on biological or synthetic datasets have provided the community with a clear picture on the relative performances of the available methods in various settings. However, synthetic dataset simulation tools are still missing in the interspecies gene expression context. In this work, we develop and implement a new simulation framework. This tool builds on both the RNA-Seq and the phylogenetic comparative methods literatures to generate realistic count datasets, while taking into account the phylogenetic relationships between the samples. We illustrate the usefulness of this new framework through a targeted simulation study, that reproduces the features of a recently published dataset, containing gene expression data in adult eye tissue across blind and sighted freshwater crayfish species. Using our simulated datasets, we perform a fair comparison of several approaches used for differential expression analysis. This benchmark reveals some of the strengths and weaknesses of both the classical and phylogenetic approaches for interspecies differential expression analysis, and allows for a reanalysis of the crayfish dataset. The tool has been integrated in the R package compcodeR, freely available on Bioconductor.
种间 RNA-Seq 数据集越来越常见,有潜力回答关于基因表达进化的新问题。单物种差异表达分析现在是一个研究充分的问题,得益于合理的统计方法。对生物或合成数据集的广泛综述为社区提供了一个关于各种情况下可用方法相对性能的清晰画面。然而,在种间基因表达的背景下,仍然缺少合成数据集模拟工具。在这项工作中,我们开发并实现了一个新的模拟框架。该工具基于 RNA-Seq 和系统发育比较方法文献,生成真实的计数数据集,同时考虑到样本之间的系统发育关系。我们通过有针对性的模拟研究说明了这个新框架的有用性,该研究再现了最近发表的一个数据集的特征,该数据集包含了盲眼和有眼淡水小龙虾物种的成年眼部组织中的基因表达数据。使用我们的模拟数据集,我们对几种用于差异表达分析的方法进行了公平比较。该基准揭示了经典和系统发育方法在种间差异表达分析中的一些优缺点,并允许对小龙虾数据集进行重新分析。该工具已集成到 R 包 compcodeR 中,可在 Bioconductor 上免费获得。