Kim Eunji, Ivanov Ivan, Hua Jianping, Lampe Johanna W, Hullar Meredith Aj, Chapkin Robert S, Dougherty Edward R
Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA.
Department of Veterinary Physiology & Pharmacology, Texas A&M University, College Station, TX, USA.
Cancer Inform. 2017 Jun 12;16:1176935117710530. doi: 10.1177/1176935117710530. eCollection 2017.
Ranking feature sets for phenotype classification based on gene expression is a challenging issue in cancer bioinformatics. When the number of samples is small, all feature selection algorithms are known to be unreliable, producing significant error, and error estimators suffer from different degrees of imprecision. The problem is compounded by the fact that the accuracy of classification depends on the manner in which the phenomena are transformed into data by the measurement technology. Because next-generation sequencing technologies amount to a nonlinear transformation of the actual gene or RNA concentrations, they can potentially produce less discriminative data relative to the actual gene expression levels. In this study, we compare the performance of ranking feature sets derived from a model of RNA-Seq data with that of a multivariate normal model of gene concentrations using 3 measures: (1) ranking power, (2) length of extensions, and (3) Bayes features. This is the model-based study to examine the effectiveness of reporting lists of small feature sets using RNA-Seq data and the effects of different model parameters and error estimators. The results demonstrate that the general trends of the parameter effects on the ranking power of the underlying gene concentrations are preserved in the RNA-Seq data, whereas the power of finding a good feature set becomes weaker when gene concentrations are transformed by the sequencing machine.
基于基因表达对表型分类的特征集进行排序是癌症生物信息学中的一个具有挑战性的问题。当样本数量较少时,所有特征选择算法都被认为是不可靠的,会产生显著误差,并且误差估计器也存在不同程度的不精确性。此外,分类的准确性取决于测量技术将现象转化为数据的方式,这使得问题更加复杂。由于下一代测序技术相当于对实际基因或RNA浓度的非线性转换,相对于实际基因表达水平,它们可能会产生区分性较差的数据。在本研究中,我们使用三种指标比较了从RNA-Seq数据模型导出的特征集排序性能与基因浓度多元正态模型的特征集排序性能:(1)排序能力,(2)扩展长度,以及(3)贝叶斯特征。这是一项基于模型的研究,旨在检验使用RNA-Seq数据报告小特征集列表的有效性以及不同模型参数和误差估计器的影响。结果表明,参数对潜在基因浓度排序能力的影响的总体趋势在RNA-Seq数据中得以保留,而当基因浓度由测序机器进行转换时,找到良好特征集的能力会变弱。