Wu Chiung-Ting, Shen Minjie, Du Dongping, Cheng Zuolin, Parker Sarah J, Lu Yingzhou, Van Eyk Jennifer E, Yu Guoqiang, Clarke Robert, Herrington David M, Wang Yue
Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA.
Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, USA.
Bioinform Adv. 2022 Oct 20;2(1):vbac076. doi: 10.1093/bioadv/vbac076. eCollection 2022.
Data normalization is essential to ensure accurate inference and comparability of gene expression measures across samples or conditions. Ideally, gene expression data should be rescaled based on consistently expressed reference genes. However, to normalize biologically diverse samples, the most commonly used reference genes exhibit striking expression variability and size-factor or distribution-based normalization methods can be problematic when the amount of asymmetry in differential expression is significant.
We report an efficient and accurate data-driven method-Cosine score-based iterative normalization (Cosbin)-to normalize biologically diverse samples. Based on the Cosine scores of cross-condition expression patterns, the Cosbin pipeline iteratively eliminates asymmetric differentially expressed genes, identifies consistently expressed genes, and calculates sample-wise normalization factors. We demonstrate the superior performance and enhanced utility of Cosbin compared with six representative peer methods using both simulation and real multi-omics expression datasets. Implemented in open-source R scripts and specifically designed to address normalization bias due to significant asymmetry in differential expression across multiple conditions, the Cosbin tool complements rather than replaces the existing methods and will allow biologists to more accurately detect true molecular signals among diverse phenotypic groups.
The R scripts of Cosbin pipeline are freely available at https://github.com/MinjieSh/Cosbin.
Supplementary data are available at online.
数据归一化对于确保跨样本或条件的基因表达测量的准确推断和可比性至关重要。理想情况下,基因表达数据应基于持续表达的参考基因进行重新缩放。然而,为了对生物多样性不同的样本进行归一化,最常用的参考基因表现出显著的表达变异性,并且当差异表达中的不对称量很大时,基于大小因子或分布的归一化方法可能会出现问题。
我们报告了一种高效且准确的数据驱动方法——基于余弦评分的迭代归一化(Cosbin),用于对生物多样性不同的样本进行归一化。基于跨条件表达模式的余弦评分,Cosbin流程迭代地消除不对称差异表达基因,识别持续表达的基因,并计算样本特异性归一化因子。我们使用模拟和真实的多组学表达数据集,证明了Cosbin与六种代表性同类方法相比具有卓越的性能和更高的实用性。Cosbin工具以开源R脚本实现,专门设计用于解决由于跨多个条件的差异表达中存在显著不对称而导致的归一化偏差,它补充而非取代现有方法,将使生物学家能够在不同表型组中更准确地检测真实的分子信号。
Cosbin流程的R脚本可在https://github.com/MinjieSh/Cosbin上免费获取。
补充数据可在网上获取。