Knight Jason M, Ivanov Ivan, Dougherty Edward R
Department of Electrical Engineering in Texas A&M University, 3128 TAMU, College Station, 77843, TX, USA.
Department of Veterinary Physiology and Pharmacology in Texas A&M University, 3128 TAMU, College Station, 77843, TX, USA.
BMC Bioinformatics. 2014 Dec 10;15(1):401. doi: 10.1186/s12859-014-0401-3.
Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions.
Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA).
Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss .
测序数据集由映射到参考基因组特定区域的有限数量的读段组成。对这些数据集进行建模的大部分工作都集中在单变量差异表达基因的检测上。然而,对于分类而言,我们必须考虑多个基因及其相互作用。
因此,我们引入了一种分层多变量泊松模型(MP)和相关的最优贝叶斯分类器(OBC),用于使用测序数据对样本进行分类。由于缺乏闭式解,我们采用蒙特卡罗马尔可夫链(MCMC)方法来进行分类。对于两个合成数据集以及一系列分类问题难度,我们证明了与典型分类器相比具有优越或等效的分类性能。我们还引入了贝叶斯最小均方误差(MMSE)条件误差估计器,并展示了其在特征空间上的计算。此外,在来自癌症基因组图谱(TCGA)的包含两种肺癌肿瘤类型的RNA测序数据集上,我们证明了具有优越或领先的分类性能。
通过基于模型的最优贝叶斯分类,我们证明了在合成和真实RNA测序数据集上均具有优越的分类性能。一个教程视频和Python源代码可在http://bit.ly/1gimnss以开源许可获取。