• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

比较 RNA-Seq 数据预处理管道,以跨独立研究进行转录组预测。

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

机构信息

School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.

Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

出版信息

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

DOI:10.1186/s12859-024-05801-x
PMID:38720247
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11080237/
Abstract

BACKGROUND

RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.

RESULTS

We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.

CONCLUSION

By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

摘要

背景

RNA 测序结合机器学习技术为癌症的分子分类提供了一种现代方法。通过从癌症患者中提取的基因表达测量值,可以为已知组织类型构建反映疾病类别的分类器预测器。当前癌症预测器的一个挑战是,当整合来自不同实验室生成的分子数据集时,它们的性能估计往往不理想。通常,数据的质量是可变的,获取方式不同,并且包含干扰预测模型提取有用信息的噪声。可以应用数据预处理方法来尝试减少这些系统变化,并在使用机器学习模型解决组织起源之前协调数据集。

结果

我们旨在通过试验和比较来研究数据预处理步骤(重点是归一化、批次效应校正和数据缩放)的影响。我们的目标是改善大规模 RNA-Seq 数据集上的常见癌症的跨研究组织起源预测,这些数据集源自数千名患者和十多种肿瘤类型。结果表明,数据预处理操作的选择影响了为组织起源预测构建的相关分类器模型的性能。

结论

通过将 TCGA 用作训练集并应用数据预处理方法,我们证明了批次效应校正通过加权 F1 分数来提高针对独立 GTEx 测试数据集的组织起源解析性能。另一方面,当独立测试数据集从 ICGC 和 GEO 中的单独研究中汇总时,使用数据预处理操作会恶化分类性能。因此,根据我们对这些公开可用的大规模 RNA-Seq 数据集的发现,数据预处理技术在机器学习管道中的应用并不总是合适的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/3694df1518a8/12859_2024_5801_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/66cd68341a80/12859_2024_5801_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/a7e42a7e7099/12859_2024_5801_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/3694df1518a8/12859_2024_5801_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/66cd68341a80/12859_2024_5801_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/a7e42a7e7099/12859_2024_5801_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/3694df1518a8/12859_2024_5801_Fig3_HTML.jpg

相似文献

1
A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.比较 RNA-Seq 数据预处理管道,以跨独立研究进行转录组预测。
BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.
2
Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance.超越基准测试,迈向针对特定数据集的单细胞 RNA-seq 管道性能的预测模型。
Genome Biol. 2024 Jun 17;25(1):159. doi: 10.1186/s13059-024-03304-9.
3
Benchmarking UMI-based single-cell RNA-seq preprocessing workflows.基于 UMIs 的单细胞 RNA-seq 预处理工作流程的基准测试。
Genome Biol. 2021 Dec 14;22(1):339. doi: 10.1186/s13059-021-02552-3.
4
scTab: Scaling cross-tissue single-cell annotation models.scTab:缩放跨组织单细胞注释模型。
Nat Commun. 2024 Aug 4;15(1):6611. doi: 10.1038/s41467-024-51059-5.
5
Compendiums of cancer transcriptomes for machine learning applications.癌症转录组学纲要,用于机器学习应用。
Sci Data. 2019 Oct 8;6(1):194. doi: 10.1038/s41597-019-0207-2.
6
A hybrid deep clustering approach for robust cell type profiling using single-cell RNA-seq data.基于单细胞 RNA-seq 数据的混合深度聚类方法进行稳健的细胞类型分析。
RNA. 2020 Oct;26(10):1303-1319. doi: 10.1261/rna.074427.119. Epub 2020 Jun 12.
7
Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets.用于解决单细胞转录组数据集分析中挑战的数据标准化。
BMC Genomics. 2024 May 6;25(1):444. doi: 10.1186/s12864-024-10364-5.
8
Processing and Analysis of RNA-seq Data from Public Resources.从公共资源中处理和分析 RNA-seq 数据。
Methods Mol Biol. 2021;2243:81-94. doi: 10.1007/978-1-0716-1103-6_4.
9
PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning.PanClassif:使用机器学习改进单细胞RNA测序基因表达数据的泛癌分类
Genomics. 2022 Mar;114(2):110264. doi: 10.1016/j.ygeno.2022.01.001. Epub 2022 Jan 6.
10
Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data.潜伏细胞分析能稳健地揭示大规模单细胞 RNA-seq 数据中的细微多样性。
Nucleic Acids Res. 2019 Dec 16;47(22):e143. doi: 10.1093/nar/gkz826.

引用本文的文献

1
A computational framework for detecting inter-tissue gene-expression coordination changes with aging.一种用于检测衰老过程中组织间基因表达协调变化的计算框架。
Sci Rep. 2025 Mar 31;15(1):11014. doi: 10.1038/s41598-025-94043-9.
2
The importance of data transformation in RNA-Seq preprocessing for bladder cancer subtyping.数据转换在膀胱癌亚型RNA测序预处理中的重要性。
BMC Res Notes. 2025 Feb 10;18(1):61. doi: 10.1186/s13104-025-07138-x.

本文引用的文献

1
Single-Cell RNA Sequencing Unifies Developmental Programs of Esophageal and Gastric Intestinal Metaplasia.单细胞 RNA 测序统一了食管和胃肠化生的发育程序。
Cancer Discov. 2023 Jun 2;13(6):1346-1363. doi: 10.1158/2159-8290.CD-22-0824.
2
TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks.TULIP:一种基于RNA测序,使用卷积神经网络的原发性肿瘤类型预测工具。
Cancer Inform. 2022 Dec 5;21:11769351221139491. doi: 10.1177/11769351221139491. eCollection 2022.
3
Machine learning in clinical decision making.
机器学习在临床决策中的应用。
Med. 2021 Jun 11;2(6):642-665. doi: 10.1016/j.medj.2021.04.006. Epub 2021 Apr 30.
4
Navigating the pitfalls of applying machine learning in genomics.在基因组学中应用机器学习的陷阱。
Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.
5
Hyperparameter Tuning and Pipeline Optimization via Grid Search Method and Tree-Based AutoML in Breast Cancer Prediction.通过网格搜索法和基于树的自动机器学习进行乳腺癌预测中的超参数调整与管道优化
J Pers Med. 2021 Sep 29;11(10):978. doi: 10.3390/jpm11100978.
6
A guide to machine learning for biologists.生物学机器学习指南。
Nat Rev Mol Cell Biol. 2022 Jan;23(1):40-55. doi: 10.1038/s41580-021-00407-0. Epub 2021 Sep 13.
7
Machine learning analysis of TCGA cancer data.TCGA癌症数据的机器学习分析。
PeerJ Comput Sci. 2021 Jul 12;7:e584. doi: 10.7717/peerj-cs.584. eCollection 2021.
8
Adversarial deconfounding autoencoder for learning robust gene expression embeddings.用于学习稳健基因表达嵌入的对抗性去混淆自动编码器。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i573-i582. doi: 10.1093/bioinformatics/btaa796.
9
A Cancer Biologist's Primer on Machine Learning Applications in High-Dimensional Cytometry.癌症生物学家机器学习在高维细胞仪应用入门
Cytometry A. 2020 Aug;97(8):782-799. doi: 10.1002/cyto.a.24158. Epub 2020 Jun 30.
10
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.基于图的基因组比对和基因分型与 HISAT2 和 HISAT-genotype。
Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2.