Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway.
Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway.
Stat Med. 2022 Oct 15;41(23):4532-4553. doi: 10.1002/sim.9524. Epub 2022 Jul 18.
Variable selection is crucial in high-dimensional omics-based analyses, since it is biologically reasonable to assume only a subset of non-noisy features contributes to the data structures. However, the task is particularly hard in an unsupervised setting, and a priori ad hoc variable selection is still a very frequent approach, despite the evident drawbacks and lack of reproducibility. We propose a Bayesian variable selection approach for rank-based unsupervised transcriptomic analysis. Making use of data rankings instead of the actual continuous measurements increases the robustness of conclusions when compared to classical statistical methods, and embedding variable selection into the inferential tasks allows complete reproducibility. Specifically, we develop a novel extension of the Bayesian Mallows model for variable selection that allows for a full probabilistic analysis, leading to coherent quantification of uncertainties. Simulation studies demonstrate the versatility and robustness of the proposed method in a variety of scenarios, as well as its superiority with respect to several competitors when varying the data dimension or data generating process. We use the novel approach to analyze genome-wide RNAseq gene expression data from ovarian cancer patients: several genes that affect cancer development are correctly detected in a completely unsupervised fashion, showing the usefulness of the method in the context of signature discovery for cancer genomics. Moreover, the possibility to also perform uncertainty quantification plays a key role in the subsequent biological investigation.
变量选择在基于组学的高维分析中至关重要,因为假设只有一小部分非噪声特征有助于数据结构是符合生物学合理性的。然而,在无监督环境下,这项任务特别困难,尽管存在明显的缺点和缺乏可重复性,但先验的特定变量选择仍然是一种非常常见的方法。我们提出了一种基于贝叶斯的变量选择方法,用于基于秩的无监督转录组学分析。与经典统计方法相比,利用数据排名而不是实际的连续测量值来增加结论的稳健性,并且将变量选择嵌入推理任务中可以实现完全可重复性。具体来说,我们为变量选择开发了一种新颖的贝叶斯马罗模型扩展,允许进行完整的概率分析,从而对不确定性进行一致的量化。模拟研究表明,在各种情况下,所提出的方法具有多功能性和稳健性,并且在数据维度或数据生成过程发生变化时,与几个竞争对手相比具有优越性。我们使用新方法来分析卵巢癌患者的全基因组 RNAseq 基因表达数据:以完全无监督的方式正确检测到了一些影响癌症发展的基因,这表明该方法在癌症基因组学的特征发现方面的有用性。此外,进行不确定性量化的可能性在随后的生物学研究中也起着关键作用。