关于使用微小RNA数据进行肿瘤亚型分类的数据标准化和批次效应校正

On data normalization and batch-effect correction for tumor subtyping with microRNA data.

作者信息

Wu Yilin, Yuen Becky Wing-Yan, Wei Yingying, Qin Li-Xuan

机构信息

Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA.

Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, SAR, China.

出版信息

NAR Genom Bioinform. 2023 Jan 10;5(1):lqac100. doi: 10.1093/nargab/lqac100. eCollection 2023 Mar.

DOI:10.1093/nargab/lqac100

PMID:36632610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9830544/

Abstract

The discovery of new tumor subtypes has been aided by transcriptomics profiling. However, some new subtypes can be irreproducible due to data artifacts that arise from disparate experimental handling. To deal with these artifacts, methods for data normalization and batch-effect correction have been utilized before performing sample clustering for disease subtyping, despite that these methods were primarily developed for group comparison. It remains to be elucidated whether they are effective for sample clustering. We examined this issue with a re-sampling-based simulation study that leverages a pair of microRNA microarray data sets. Our study showed that (i) normalization generally benefited the discovery of sample clusters and quantile normalization tended to be the best performer, (ii) batch-effect correction was harmful when data artifacts confounded with biological signals, and (iii) their performance can be influenced by the choice of clustering method with the Prediction Around Medoid method based on Pearson correlation being consistently a best performer. Our study provides important insights on the use of data normalization and batch-effect correction in connection with the design of array-to-sample assignment and the choice of clustering method for facilitating accurate and reproducible discovery of tumor subtypes with microRNAs.

摘要

转录组分析有助于发现新的肿瘤亚型。然而，由于不同实验操作产生的数据假象，一些新亚型可能无法重复。为了处理这些假象，在进行疾病亚型样本聚类之前，已经采用了数据归一化和批次效应校正方法，尽管这些方法主要是为组间比较而开发的。它们对样本聚类是否有效仍有待阐明。我们利用一对 microRNA 微阵列数据集，通过基于重采样的模拟研究来探讨这个问题。我们的研究表明：（i）归一化通常有利于样本聚类的发现，分位数归一化往往表现最佳；（ii）当数据假象与生物信号混淆时，批次效应校正有害；（iii）它们的性能会受到聚类方法选择的影响，基于皮尔逊相关性的围绕中位数预测方法始终是最佳性能者。我们的研究为结合阵列到样本分配设计和聚类方法选择使用数据归一化和批次效应校正提供了重要见解，以促进利用 microRNA 准确且可重复地发现肿瘤亚型。