Kim Hyun, Chang Won, Chae Seok Joo, Park Jong-Eun, Seo Minseok, Kim Jae Kyoung
Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea.
Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, 45221, USA.
Nat Commun. 2024 Apr 27;15(1):3575. doi: 10.1038/s41467-024-47884-3.
High dimensionality and noise have limited the new biological insights that can be discovered in scRNA-seq data. While dimensionality reduction tools have been developed to extract biological signals from the data, they often require manual determination of signal dimension, introducing user bias. Furthermore, a common data preprocessing method, log normalization, can unintentionally distort signals in the data. Here, we develop scLENS, a dimensionality reduction tool that circumvents the long-standing issues of signal distortion and manual input. Specifically, we identify the primary cause of signal distortion during log normalization and effectively address it by uniformizing cell vector lengths with L2 normalization. Furthermore, we utilize random matrix theory-based noise filtering and a signal robustness test to enable data-driven determination of the threshold for signal dimensions. Our method outperforms 11 widely used dimensionality reduction tools and performs particularly well for challenging scRNA-seq datasets with high sparsity and variability. To facilitate the use of scLENS, we provide a user-friendly package that automates accurate signal detection of scRNA-seq data without manual time-consuming tuning.
高维度和噪声限制了在单细胞RNA测序(scRNA-seq)数据中发现的新生物学见解。虽然已经开发了降维工具来从数据中提取生物学信号,但它们通常需要手动确定信号维度,从而引入用户偏差。此外,一种常见的数据预处理方法——对数归一化,可能会无意中扭曲数据中的信号。在这里,我们开发了scLENS,这是一种降维工具,它规避了信号失真和手动输入这两个长期存在的问题。具体来说,我们确定了对数归一化过程中信号失真的主要原因,并通过L2归一化使细胞向量长度均匀化来有效解决这一问题。此外,我们利用基于随机矩阵理论的噪声过滤和信号稳健性测试,实现数据驱动的信号维度阈值确定。我们的方法优于11种广泛使用的降维工具,对于具有高稀疏性和可变性的具有挑战性的scRNA-seq数据集表现尤其出色。为了便于使用scLENS,我们提供了一个用户友好的软件包,可自动对scRNA-seq数据进行准确的信号检测,而无需进行耗时的手动调整。