• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

归一化如何影响 RNA-seq 疾病诊断?

How does normalization impact RNA-seq disease diagnosis?

机构信息

Department of Computer and Information Science, Fordham University, Lincoln Center, New York, NY 10023, USA.

Department of Public Health, Xi'an Medical University, Xi'an 710021, China.

出版信息

J Biomed Inform. 2018 Sep;85:80-92. doi: 10.1016/j.jbi.2018.07.016. Epub 2018 Jul 21.

DOI:10.1016/j.jbi.2018.07.016
PMID:30041017
Abstract

With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.

摘要

随着下一代高通量技术的涌现,RNA-seq 数据在疾病诊断中发挥着越来越重要的作用,其中标准化被认为是产生可比样本的必要步骤。最近的研究提出了不同的标准化方法来消除 RNA 测序中的各种技术偏差。然而,以前没有研究评估标准化对 RNA-seq 疾病诊断的影响。在这项研究中,我们通过分析结构化大数据来研究这个问题:从 TCGA 门户获取的 RNA-seq 数据,因其在 RNA-seq 疾病诊断中的普及而受到欢迎。我们提出了一种新的标准化效果测试算法,即诊断指数 (d-index) 和数据熵,以使用最先进的机器学习模型分析和评估标准化对 RNA-seq 疾病诊断的影响。此外,我们提出了一种原始的可视化分析方法来比较标准化数据与原始数据的性能。我们发现,标准化数据的诊断效果通常与原始数据相当,甚至更低。此外,一些标准化方法(例如 RPKM)甚至在疾病诊断中带来负面影响。另一方面,原始数据似乎有潜力更好地或至少与标准化数据一样破译病理状态。我们的可视化分析还表明,一些标准化方法甚至会带来“异常值”,这不可避免地会降低诊断中的样本可检测性。更重要的是,我们的数据熵分析表明,标准化数据通常表现出与原始数据相当或更低的熵值。那些具有高熵值的数据往往比具有低熵值的数据具有更好的诊断效果。此外,我们发现高维不平衡 (HDI) 数据在诊断中不受任何标准化程序的影响,并且仅通过识别大多数类型而忽略原始或标准化数据,几乎无法通过所有机器学习模型。我们的结果表明,标准化数据在疾病诊断中的表现并不比原始数据具有统计学上的显著优势。这进一步表明,在 RNA-seq 疾病诊断中,标准化可能不是一个不可或缺的步骤,或者至少一些标准化过程可能不是。相反,原始数据可能在捕获不同病理条件下更多原始转录组模式方面表现更好。

相似文献

1
How does normalization impact RNA-seq disease diagnosis?归一化如何影响 RNA-seq 疾病诊断?
J Biomed Inform. 2018 Sep;85:80-92. doi: 10.1016/j.jbi.2018.07.016. Epub 2018 Jul 21.
2
Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions.从假设的角度选择样本间 RNA-Seq 标准化方法。
Brief Bioinform. 2018 Sep 28;19(5):776-792. doi: 10.1093/bib/bbx008.
3
Removing technical variability in RNA-seq data using conditional quantile normalization.使用条件分位数归一化去除 RNA-seq 数据中的技术变异性。
Biostatistics. 2012 Apr;13(2):204-16. doi: 10.1093/biostatistics/kxr054. Epub 2012 Jan 27.
4
ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data.ChimeRScope:一种使用双端RNA测序数据进行融合转录本预测的新型无比对算法。
Nucleic Acids Res. 2017 Jul 27;45(13):e120. doi: 10.1093/nar/gkx315.
5
Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data.比较Illumina高通量RNA测序数据差异分析的标准化方法。
BMC Bioinformatics. 2015 Oct 28;16:347. doi: 10.1186/s12859-015-0778-7.
6
mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies.mRNA 富集方案决定了 RNA-Seq 研究中外源 RNA Spike-in 对照品定量的特点。
Sci China Life Sci. 2013 Feb;56(2):134-42. doi: 10.1007/s11427-013-4437-9. Epub 2013 Feb 8.
7
DETECTION OF BACTERIAL SMALL TRANSCRIPTS FROM RNA-SEQ DATA: A COMPARATIVE ASSESSMENT.从RNA测序数据中检测细菌小转录本:一项比较评估
Pac Symp Biocomput. 2016;21:456-67.
8
Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq.异构体丰度推断能更准确地估计RNA测序中的基因表达水平。
J Bioinform Comput Biol. 2010 Dec;8 Suppl 1:177-92. doi: 10.1142/s0219720010005178.
9
Diagnostic biases in translational bioinformatics.转化生物信息学中的诊断偏差。
BMC Med Genomics. 2015 Aug 1;8:46. doi: 10.1186/s12920-015-0116-y.
10
Comparative study of de novo assembly and genome-guided assembly strategies for transcriptome reconstruction based on RNA-Seq.基于 RNA-Seq 的转录组重构中从头组装和基因组指导组装策略的比较研究。
Sci China Life Sci. 2013 Feb;56(2):143-55. doi: 10.1007/s11427-013-4442-z. Epub 2013 Feb 8.

引用本文的文献

1
Explainable Machine Learning Models for Glioma Subtype Classification and Survival Prediction.用于脑胶质瘤亚型分类和生存预测的可解释机器学习模型
Cancers (Basel). 2025 Aug 9;17(16):2614. doi: 10.3390/cancers17162614.
2
MNMO: discover driver genes from a multi-omics data based-multi-layer network.MNMO:从基于多组学数据的多层网络中发现驱动基因。
Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf134.
3
Variability in donor leukocyte counts confound the use of common RNA sequencing data normalization strategies in transcriptomic biomarker studies performed with whole blood.
供者白细胞计数的变异性使得在全血进行的转录组生物标志物研究中使用常见的 RNA 测序数据标准化策略变得复杂。
Sci Rep. 2023 Sep 19;13(1):15514. doi: 10.1038/s41598-023-41443-4.
4
Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development.将 RNA-Seq 基因表达转化为跟踪多阶段早期到晚期癌症发展过程中的癌症进展。
PLoS One. 2023 Apr 24;18(4):e0284458. doi: 10.1371/journal.pone.0284458. eCollection 2023.
5
An artificial neural network model to predict the mortality of COVID-19 patients using routine blood samples at the time of hospital admission: Development and validation study.一种使用入院时常规血液样本预测 COVID-19 患者死亡率的人工神经网络模型:开发和验证研究。
Medicine (Baltimore). 2021 Jul 16;100(28):e26532. doi: 10.1097/MD.0000000000026532.
6
Similarities and dissimilarities between psychiatric cluster disorders.精神障碍簇障碍的异同。
Mol Psychiatry. 2021 Sep;26(9):4853-4863. doi: 10.1038/s41380-021-01030-3. Epub 2021 Jan 27.
7
Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection.用于健康生物标志物分类的可解释对数对比:一种平衡选择的新方法。
mSystems. 2020 Apr 7;5(2):e00230-19. doi: 10.1128/mSystems.00230-19.