Suppr超能文献

一项关于RNA测序数据分类的综合模拟研究。

A comprehensive simulation study on classification of RNA-Seq data.

作者信息

Zararsız Gökmen, Goksuluk Dincer, Korkmaz Selcuk, Eldem Vahap, Zararsiz Gozde Erturk, Duru Izzet Parug, Ozturk Ahmet

机构信息

Turcosa Analytics Solutions Ltd Co, Erciyes Teknopark, 38039, Kayseri, Turkey.

Department of Biostatistics, Erciyes University, Kayseri, Turkey.

出版信息

PLoS One. 2017 Aug 23;12(8):e0182507. doi: 10.1371/journal.pone.0182507. eCollection 2017.

Abstract

RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.

摘要

RNA测序(RNA-Seq)是一种利用新一代测序技术对生物体进行基因表达谱分析的强大技术。开发基于基因表达的分类算法是一种新兴的强大方法,可用于分子水平的诊断、疾病分类和监测,以及提供疾病的潜在标志物。针对基因表达数据分类提出的大多数统计方法要么基于连续尺度(例如微阵列数据),要么需要正态分布假设。因此,这些方法不能直接应用于RNA-Seq数据,因为它们违反了数据结构和分布假设。然而,对这些算法进行适当修改后有可能应用于RNA-Seq数据。一种方法是开发基于计数的分类器,如泊松线性判别分析和负二项线性判别分析。另一种方法是使数据更接近微阵列并应用基于微阵列的分类器。在本研究中,我们比较了几种分类器,包括经过和未经过幂变换的PLDA、NBLDA、单支持向量机(SVM)、装袋支持向量机(bagSVM)、分类与回归树(CART)以及随机森林(RF)。我们还研究了几个参数,如过度离散、样本大小、基因数量、类别数量、差异表达率和变换方法对模型性能的影响。进行了全面的模拟研究,并将结果与两个miRNA和两个mRNA实验数据集的结果进行比较。结果表明,增加样本大小、差异表达率以及降低离散参数和组数会导致分类准确率提高。与差异表达研究类似,在处理RNA-Seq数据的过度离散时,对其进行分类需要格外小心。我们得出结论,作为基于计数的分类器,经过幂变换的PLDA,以及作为基于微阵列的分类器,经过vst或rlog变换的RF和SVM分类器可能是分类的不错选择。一个R/BIOCONDUCTOR软件包MLSeq可在https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c08/5568128/87fa218a4574/pone.0182507.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验