Department of Statistical Sciences, University of Padova, Padova 35121, Italy.
RENCI, University of North Carolina, Chapel Hill, NC 27517, USA.
Bioinformatics. 2020 Jun 1;36(11):3522-3527. doi: 10.1093/bioinformatics/btaa189.
Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data.
The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours.
Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies.
低维数据表示法在生物医学研究中被常规用于可视化、解释和交流来自不同管道的结果。在本文中,我们提出了一种新的方法,可以直接估计不受批次效应影响的 t-SNE 嵌入。未经校正,数据中的有趣结构可能会被批次效应所掩盖。因此,该算法可以极大地帮助可视化高维数据。
所提出的方法基于线性代数和约束优化,在许多高维环境中导致了高效的算法和快速计算。在人工单细胞转录谱数据上的结果表明,所提出的方法成功地从 t-SNE 嵌入中去除了多个批次效应,同时保留了关于细胞类型的基本信息。当应用于单细胞基因表达数据以研究小鼠成神经管细胞瘤时,该方法成功地去除了与小鼠标识符和实验日期相关的批次,同时保留了少突胶质细胞、星形胶质细胞、内皮细胞和小胶质细胞的簇,这些细胞预计位于肿瘤内或附近的基质中。
实现所提出方法的源代码可在 https://github.com/emanuelealiverti/BC_tSNE 上作为 R 包获得,包括一个用于重现模拟研究的教程。