基于模型的使用RNA测序数据报告小特征集列表有效性的研究

The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.

作者信息

Kim Eunji, Ivanov Ivan, Hua Jianping, Lampe Johanna W, Hullar Meredith Aj, Chapkin Robert S, Dougherty Edward R

机构信息

Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA.

Department of Veterinary Physiology & Pharmacology, Texas A&M University, College Station, TX, USA.

出版信息

Cancer Inform. 2017 Jun 12;16:1176935117710530. doi: 10.1177/1176935117710530. eCollection 2017.

DOI:10.1177/1176935117710530

PMID:28659712

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5470876/

Abstract

Ranking feature sets for phenotype classification based on gene expression is a challenging issue in cancer bioinformatics. When the number of samples is small, all feature selection algorithms are known to be unreliable, producing significant error, and error estimators suffer from different degrees of imprecision. The problem is compounded by the fact that the accuracy of classification depends on the manner in which the phenomena are transformed into data by the measurement technology. Because next-generation sequencing technologies amount to a nonlinear transformation of the actual gene or RNA concentrations, they can potentially produce less discriminative data relative to the actual gene expression levels. In this study, we compare the performance of ranking feature sets derived from a model of RNA-Seq data with that of a multivariate normal model of gene concentrations using 3 measures: (1) ranking power, (2) length of extensions, and (3) Bayes features. This is the model-based study to examine the effectiveness of reporting lists of small feature sets using RNA-Seq data and the effects of different model parameters and error estimators. The results demonstrate that the general trends of the parameter effects on the ranking power of the underlying gene concentrations are preserved in the RNA-Seq data, whereas the power of finding a good feature set becomes weaker when gene concentrations are transformed by the sequencing machine.

摘要

基于基因表达对表型分类的特征集进行排序是癌症生物信息学中的一个具有挑战性的问题。当样本数量较少时，所有特征选择算法都被认为是不可靠的，会产生显著误差，并且误差估计器也存在不同程度的不精确性。此外，分类的准确性取决于测量技术将现象转化为数据的方式，这使得问题更加复杂。由于下一代测序技术相当于对实际基因或RNA浓度的非线性转换，相对于实际基因表达水平，它们可能会产生区分性较差的数据。在本研究中，我们使用三种指标比较了从RNA-Seq数据模型导出的特征集排序性能与基因浓度多元正态模型的特征集排序性能：（1）排序能力，（2）扩展长度，以及（3）贝叶斯特征。这是一项基于模型的研究，旨在检验使用RNA-Seq数据报告小特征集列表的有效性以及不同模型参数和误差估计器的影响。结果表明，参数对潜在基因浓度排序能力的影响的总体趋势在RNA-Seq数据中得以保留，而当基因浓度由测序机器进行转换时，找到良好特征集的能力会变弱。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a6e4/5470876/f50338f0a4a1/10.1177_1176935117710530-fig1.jpg

相似文献

The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.基于模型的使用RNA测序数据报告小特征集列表有效性的研究

Cancer Inform. 2017 Jun 12;16:1176935117710530. doi: 10.1177/1176935117710530. eCollection 2017.

Characterization of the effectiveness of reporting lists of small feature sets relative to the accuracy of the prior biological knowledge.相对于先前生物知识的准确性，对小特征集报告列表的有效性进行表征。

Cancer Inform. 2010 Mar 18;9:49-60. doi: 10.4137/cin.s4020.

Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。

Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.

Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data.读取-分割-运行：一种利用RNA测序数据识别全基因组非经典剪接区域的改进型生物信息学流程。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):503. doi: 10.1186/s12864-016-2896-7.

sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and Statistic.sigFeature：一种使用支持向量机和统计方法对基因表达数据进行分类的新型显著特征选择方法

Front Genet. 2020 Apr 3;11:247. doi: 10.3389/fgene.2020.00247. eCollection 2020.

RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes.RNA-seq 辅助工具：基于机器学习的方法，以鉴定更多受转录调控的基因。

BMC Genomics. 2018 Jul 20;19(1):546. doi: 10.1186/s12864-018-4932-2.

Structural MRI-based detection of Alzheimer's disease using feature ranking and classification error.基于结构磁共振成像，利用特征排序和分类误差检测阿尔茨海默病。

Comput Methods Programs Biomed. 2016 Dec;137:177-193. doi: 10.1016/j.cmpb.2016.09.019. Epub 2016 Sep 26.

Detection of high variability in gene expression from single-cell RNA-seq profiling.从单细胞RNA测序分析中检测基因表达的高变异性。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):508. doi: 10.1186/s12864-016-2897-6.

Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies.在RNA测序病例对照研究中使用监督学习方法进行基因选择

Front Genet. 2018 Aug 3;9:297. doi: 10.3389/fgene.2018.00297. eCollection 2018.

CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data.CAFÉ-Map：用于挖掘高维生物医学数据的上下文感知特征映射。

Comput Biol Med. 2016 Dec 1;79:68-79. doi: 10.1016/j.compbiomed.2016.10.006. Epub 2016 Oct 11.

引用本文的文献

Gut-host Crosstalk: Methodological and Computational Challenges.肠道-宿主串扰：方法学和计算挑战。

Dig Dis Sci. 2020 Mar;65(3):686-694. doi: 10.1007/s10620-020-06105-9.

本文引用的文献

Comprehensive site-specific whole genome profiling of stromal and epithelial colonic gene signatures in human sigmoid colon and rectal tissue.全面的基于位点的人乙状结肠和直肠组织中基质和上皮结肠基因特征的全基因组特征分析。

Physiol Genomics. 2016 Sep 1;48(9):651-9. doi: 10.1152/physiolgenomics.00023.2016. Epub 2016 Jul 8.

Distinct Transcriptional Changes and Epithelial-Stromal Interactions Are Altered in Early-Stage Colon Cancer Development.在结肠癌早期发展过程中，不同的转录变化和上皮-间质相互作用会发生改变。

Mol Cancer Res. 2016 Sep;14(9):795-804. doi: 10.1158/1541-7786.MCR-16-0156. Epub 2016 Jun 27.

Stromal gene expression defines poor-prognosis subtypes in colorectal cancer.基质基因表达定义了结直肠癌的预后不良亚型。

Nat Genet. 2015 Apr;47(4):320-9. doi: 10.1038/ng.3225. Epub 2015 Feb 23.

MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification.非高斯模型最优贝叶斯分类器的MCMC实现：基于模型的RNA测序分类

BMC Bioinformatics. 2014 Dec 10;15(1):401. doi: 10.1186/s12859-014-0401-3.

Modeling the next generation sequencing sample processing pipeline for the purposes of classification.为分类目的对下一代测序样本处理管道进行建模。

BMC Bioinformatics. 2013 Oct 11;14:307. doi: 10.1186/1471-2105-14-307.

Application of the Bayesian MMSE estimator for classification error to gene expression microarray data.贝叶斯 MMSE 估计器在基因表达微阵列数据分类误差中的应用。

Bioinformatics. 2011 Jul 1;27(13):1822-31. doi: 10.1093/bioinformatics/btr272. Epub 2011 May 5.

Differential expression analysis for sequence count data.差异表达分析序列计数数据。

Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. Epub 2010 Oct 27.

Cancer Inform. 2010 Mar 18;9:49-60. doi: 10.4137/cin.s4020.

Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.mRNA-Seq 实验中标准化和差异表达的统计方法评估。

BMC Bioinformatics. 2010 Feb 18;11:94. doi: 10.1186/1471-2105-11-94.

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.edgeR：一个用于数字基因表达数据差异表达分析的 Bioconductor 包。

Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于模型的使用RNA测序数据报告小特征集列表有效性的研究

The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献