Valle Filippo, Caselle Michele, Osella Matteo
Physics Department, University of Turin and INFN, Via Pietro Giuria 1, 12125 Torino, Italy.
NAR Genom Bioinform. 2025 Apr 22;7(2):lqaf049. doi: 10.1093/nargab/lqaf049. eCollection 2025 Jun.
The availability of high-dimensional transcriptomic datasets is increasing at a tremendous pace, together with the need for suitable computational tools. Clustering and dimensionality reduction methods are popular go-to methods to identify basic structures in these datasets. At the same time, different topic modeling techniques have been developed to organize the deluge of available data of natural language using their latent topical structure. This paper leverages the statistical analogies between text and transcriptomic datasets to compare different topic modeling methods when applied to gene expression data. Specifically, we test their accuracy in the specific task of discovering and reconstructing the tissue structure of the human transcriptome and distinguishing healthy from cancerous tissues. We examine the properties of the latent space recovered by different methods, highlight their differences, and their pros and cons across different tasks. We focus in particular on how different statistical priors can affect the results and their interpretability. Finally, we show that the latent topic space can be a useful low-dimensional embedding space, where a basic neural network classifier can annotate transcriptomic profiles with high accuracy.
高维转录组数据集的可用性正以惊人的速度增长,同时对合适的计算工具的需求也在增加。聚类和降维方法是识别这些数据集中基本结构的常用方法。与此同时,已经开发出不同的主题建模技术,以利用其潜在的主题结构来整理自然语言可用数据的洪流。本文利用文本和转录组数据集之间的统计类比,比较应用于基因表达数据时的不同主题建模方法。具体来说,我们测试它们在发现和重建人类转录组组织结构以及区分健康组织和癌组织这一特定任务中的准确性。我们研究不同方法恢复的潜在空间的属性,突出它们的差异以及在不同任务中的优缺点。我们特别关注不同的统计先验如何影响结果及其可解释性。最后,我们表明潜在主题空间可以是一个有用的低维嵌入空间,其中一个基本的神经网络分类器可以高精度地注释转录组谱。