Department of Statistics, Begum Rokeya University, Rangpur 5400, Bangladesh.
Department of Bioinformatics and Public Health, Asian University for Women, Chittagong, Bangladesh.
Genomics. 2020 Mar;112(2):2000-2010. doi: 10.1016/j.ygeno.2019.11.012. Epub 2019 Nov 20.
Identification of differentially expressed genes (DEGs) under two or more experimental conditions is an important task for elucidating the molecular basis of phenotypic variation. In the recent years, next generation sequencing (RNA-seq) has become very attractive and competitive alternative to the microarrays because of reducing the cost of sequencing and limitations of microarrays. A number of methods have been developed for detecting the DEGs from RNA-seq data. Most of these methods are based on either Poisson distribution or negative binomial (NB) distribution. However, identification of DEGs based on read count data using skewed distribution is inflexible and complicated of in presence of outliers or extreme values.
Most of the existing DEGs selection methods produce lower accuracies and higher false discoveries in presence of outliers. There are some robust approaches such as edgeR_robust and DEseq2 perform well in presence of outliers for large sample case. But they show weak performance for small-sample case, in presence of outliers. To address this issues an alternative approach has emerged by transforming the RNA-seq data into microarray like data. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. However, limma by voom transformation is sensitive to outliers for small-sample case. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrate the performance of the proposed method in a comparison of seven popular biomarkers selection methods: DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR and edgeR_robust using both simulated and real dataset. Both types of experimental results show that the performance of the proposed method improve over the competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods.
We observe the improved performance of the proposed method from simulation and real RNA-seq count data analysis for both small-and large-sample cases, in presence of outliers. Therefore, our proposal is to use the proposed method instead of existing methods to obtain the better performance for selecting the DEGs.
鉴定两个或多个实验条件下的差异表达基因(DEGs)是阐明表型变异分子基础的重要任务。近年来,下一代测序(RNA-seq)因其测序成本降低和微阵列的局限性而成为极具吸引力和竞争力的替代方法。已经开发了许多用于从 RNA-seq 数据中检测 DEGs 的方法。这些方法中的大多数基于泊松分布或负二项分布(NB)。然而,基于偏态分布的读取计数数据鉴定 DEGs 不灵活且在存在异常值或极值时很复杂。
大多数现有的 DEGs 选择方法在存在异常值时会产生较低的准确性和更高的假发现率。有一些稳健的方法,例如 edgeR_robust 和 DEseq2,在大样本情况下表现良好,存在异常值。但是,它们在存在异常值的小样本情况下表现不佳。为了解决这个问题,出现了一种替代方法,即将 RNA-seq 数据转换为类似于微阵列的数据。在各种转换方法中,voom 使用 limma 管道被证明更适合 RNA-seq 数据。然而,limma 通过 voom 转换对于小样本情况非常敏感。因此,在本文中,我们使用最小β散度方法稳健化 voom 方法。我们使用七种流行的生物标志物选择方法(DEseq、DEseq2、SAMseq、Bayseq、limma(voom)、edgeR 和 edgeR_robust)比较模拟和真实数据集,对所提出的方法进行了性能评估。两种类型的实验结果都表明,该方法在存在异常值和不存在异常值的情况下,在小样本和大样本情况下,都比竞争方法的性能有所提高。
我们观察到,该方法在模拟和真实 RNA-seq 计数数据分析中小样本和大样本情况下,存在异常值时,性能得到了提高。因此,我们建议使用所提出的方法代替现有的方法,以获得更好的 DEGs 选择性能。