Suppr超能文献

比较 RNA-Seq 数据预处理管道,以跨独立研究进行转录组预测。

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

机构信息

School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.

Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

出版信息

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

Abstract

BACKGROUND

RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.

RESULTS

We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.

CONCLUSION

By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

摘要

背景

RNA 测序结合机器学习技术为癌症的分子分类提供了一种现代方法。通过从癌症患者中提取的基因表达测量值,可以为已知组织类型构建反映疾病类别的分类器预测器。当前癌症预测器的一个挑战是,当整合来自不同实验室生成的分子数据集时,它们的性能估计往往不理想。通常,数据的质量是可变的,获取方式不同,并且包含干扰预测模型提取有用信息的噪声。可以应用数据预处理方法来尝试减少这些系统变化,并在使用机器学习模型解决组织起源之前协调数据集。

结果

我们旨在通过试验和比较来研究数据预处理步骤(重点是归一化、批次效应校正和数据缩放)的影响。我们的目标是改善大规模 RNA-Seq 数据集上的常见癌症的跨研究组织起源预测,这些数据集源自数千名患者和十多种肿瘤类型。结果表明,数据预处理操作的选择影响了为组织起源预测构建的相关分类器模型的性能。

结论

通过将 TCGA 用作训练集并应用数据预处理方法,我们证明了批次效应校正通过加权 F1 分数来提高针对独立 GTEx 测试数据集的组织起源解析性能。另一方面,当独立测试数据集从 ICGC 和 GEO 中的单独研究中汇总时,使用数据预处理操作会恶化分类性能。因此,根据我们对这些公开可用的大规模 RNA-Seq 数据集的发现,数据预处理技术在机器学习管道中的应用并不总是合适的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ea8/11080237/66cd68341a80/12859_2024_5801_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验