Suppr超能文献

标准化方法对 HTSeq-FPKM-UQ 数据集上监督学习算法性能的影响:7SK RNA 表达作为结肠腺癌患者生存的预测因子。

Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma.

机构信息

Mathematical Biosciences Institute, Ohio State University, Columbus, Ohio, USA.

出版信息

Brief Bioinform. 2019 May 21;20(3):985-994. doi: 10.1093/bib/bbx153.

Abstract

MOTIVATION

One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes).

RESULTS

Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%.

摘要

动机

机器学习(ML)中的主要挑战之一是选择适当的归一化方法。在这里,我们研究了各种归一化方法对分析 FPKM 上四分位数(FPKM-UQ)RNA 测序数据集的影响。我们从 TCGA-COAD 项目中收集了结肠腺癌患者的 HTSeq-FPKM-UQ 文件。我们比较了三种最常见的归一化方法:缩放、使用 z 分数标准化和通过可视化归一化数据集并评估 12 种监督学习算法在归一化数据集上的性能来对向量进行归一化。此外,对于这三种归一化方法中的每一种,我们使用两种不同的归一化策略:归一化样本(文件)或归一化特征(基因)。

结果

无论使用哪种归一化方法,具有径向基函数核的支持向量机(SVM)模型在预测患者的生存状态方面都具有最高的准确率(78%)。然而,SVM 的拟合时间取决于归一化方法,当将文件归一化为单位长度时,拟合时间达到最小值。此外,在所有 12 种学习算法和 6 种不同的归一化技术中,标准化文件后的伯努利朴素贝叶斯模型在最大化准确率和最小化拟合时间方面具有最佳性能。我们还研究了降维方法对监督 ML 算法性能的影响。降低数据集的维数并没有增加 78%的最大准确率。然而,它导致发现了 7SK RNA 基因表达作为结肠腺癌患者生存的预测因子,准确率为 78%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验