类别编码方法用于对批量和单细胞 RNA-seq 数据进行分类的特征基因选择。

Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data.

机构信息

Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China.

Department of Mathematics, Hong Kong University, Pokfulam, Hong Kong.

出版信息

Stat Med. 2021 Aug 15;40(18):4077-4089. doi: 10.1002/sim.9015. Epub 2021 May 24.

DOI:10.1002/sim.9015

PMID:34028849

Abstract

Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not DE and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes. Widely used schemes in the literature, such as the BSS/WSS (BW) method, assume that data are normally distributed and may not be suitable for bulk and single-cell RNA-seq data. In this article, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The highest gene correlation coefficients are considered feature genes, which are the most effective for classifying bulk and single-cell RNA-seq dataset. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package named "CAEN" to facilitate wide use.

摘要

批量和单细胞 RNA-seq (scRNA-seq) 数据正在被用作生物学和医学研究中传统技术的替代品。这些数据例如被用于差异表达 (DE) 基因的检测。已经开发了几种统计方法来对批量和单细胞 RNA-seq 数据进行分类。这些特征基因对于批量和单细胞 RNA-seq 数据的分类至关重要。大多数基因不是 DE，因此与类别区分无关。为了提高分类性能并节省计算时间，有必要去除不相关的基因。去除将有助于检测重要的特征基因。文献中广泛使用的方案，如 BSS/WSS (BW) 方法，假设数据是正态分布的，可能不适合批量和单细胞 RNA-seq 数据。在本文中，提出了一种类别编码 (CAEN) 方法，用于选择批量和单细胞 RNA-seq 数据分类的特征基因。该新方法通过对每个类别的每个基因的序列样本的秩进行编码来对类别进行编码。考虑了基因和类别的相关系数，以及样本的秩和新的类别秩。最高的基因相关系数被认为是特征基因，它们是对批量和单细胞 RNA-seq 数据集进行分类最有效的基因。还为所提出的 CAEN 方法的秩一致性特性建立了确证筛选方法。模拟研究表明，在大多数情况下，使用所提出的 CAEN 方法的分类器的性能优于或至少与现有方法相当。还分析了现有的真实数据集，结果表明所提出的方法优于当前竞争对手的性能。该应用程序已被编码为一个名为“CAEN”的 R 包，以方便广泛使用。