Suppr超能文献

基于内禀熵的 scRNA-seq 数据特征选择模型

Intrinsic entropy model for feature selection of scRNA-seq data.

机构信息

State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, CAS Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China.

University of Chinese Academy of Sciences, Beijing 100049, China.

出版信息

J Mol Cell Biol. 2022 Jun 8;14(2). doi: 10.1093/jmcb/mjac008.

Abstract

Recent advances of single-cell RNA sequencing (scRNA-seq) technologies have led to extensive study of cellular heterogeneity and cell-to-cell variation. However, the high frequency of dropout events and noise in scRNA-seq data confounds the accuracy of the downstream analysis, i.e. clustering analysis, whose accuracy depends heavily on the selected feature genes. Here, by deriving an entropy decomposition formula, we propose a feature selection method, i.e. an intrinsic entropy (IE) model, to identify the informative genes for accurately clustering analysis. Specifically, by eliminating the 'noisy' fluctuation or extrinsic entropy (EE), we extract the IE of each gene from the total entropy (TE), i.e. TE = IE + EE. We show that the IE of each gene actually reflects the regulatory fluctuation of this gene in a cellular process, and thus high-IE genes provide rich information on cell type or state analysis. To validate the performance of the high-IE genes, we conduct computational analysis on both simulated datasets and real single-cell datasets by comparing with other representative methods. The results show that our IE model is not only broadly applicable and robust for different clustering and classification methods, but also sensitive for novel cell types. Our results also demonstrate that the intrinsic entropy/fluctuation of a gene serves as information rather than noise in contrast to its total entropy/fluctuation.

摘要

单细胞 RNA 测序 (scRNA-seq) 技术的最新进展使得人们对细胞异质性和细胞间的变化进行了广泛的研究。然而,scRNA-seq 数据中高频率的缺失事件和噪声混淆了下游分析(即聚类分析)的准确性,而聚类分析的准确性在很大程度上取决于所选特征基因。在这里,我们通过推导出一个熵分解公式,提出了一种特征选择方法,即内在熵 (IE) 模型,以识别信息丰富的基因,从而进行准确的聚类分析。具体来说,通过消除“嘈杂”的波动或外在熵 (EE),我们从总熵 (TE) 中提取每个基因的 IE,即 TE=IE+EE。我们表明,每个基因的 IE 实际上反映了该基因在细胞过程中的调控波动,因此高 IE 基因提供了丰富的关于细胞类型或状态分析的信息。为了验证高 IE 基因的性能,我们通过与其他有代表性的方法进行比较,在模拟数据集和真实的单细胞数据集上进行了计算分析。结果表明,我们的 IE 模型不仅广泛适用于不同的聚类和分类方法,而且对新的细胞类型也很敏感。我们的结果还表明,与总熵/波动相比,一个基因的内在熵/波动是信息而不是噪声。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a0/9175189/c9846580241e/mjac008fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验