Junet Valentin, Farrés Judith, Mas José M, Daura Xavier
Anaxomics Biotech SL, Barcelona 08008, Spain.
Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, Barcelona 08193, Spain.
Bioinformatics. 2021 Aug 25;37(16):2365-2373. doi: 10.1093/bioinformatics/btab105.
Cross-(multi)platform normalization of gene-expression microarray data remains an unresolved issue. Despite the existence of several algorithms, they are either constrained by the need to normalize all samples of all platforms together, compromising scalability and reuse, by adherence to the platforms of a specific provider, or simply by poor performance. In addition, many of the methods presented in the literature have not been specifically tested against multi-platform data and/or other methods applicable in this context. Thus, we set out to develop a normalization algorithm appropriate for gene-expression studies based on multiple, potentially large microarray sets collected along multiple platforms and at different times, applicable in systematic studies aimed at extracting knowledge from the wealth of microarray data available in public repositories; for example, for the extraction of Real-World Data to complement data from Randomized Controlled Trials. Our main focus or criterion for performance was on the capacity of the algorithm to properly separate samples from different biological groups.
We present CuBlock, an algorithm addressing this objective, together with a strategy to validate cross-platform normalization methods. To validate the algorithm and benchmark it against existing methods, we used two distinct datasets, one specifically generated for testing and standardization purposes and one from an actual experimental study. Using these datasets, we benchmarked CuBlock against ComBat (Johnson et al., 2007), UPC (Piccolo et al., 2013), YuGene (Lê Cao et al., 2014), DBNorm (Meng et al., 2017), Shambhala (Borisov et al., 2019) and a simple log2 transform as reference. We note that many other popular normalization methods are not applicable in this context. CuBlock was the only algorithm in this group that could always and clearly differentiate the underlying biological groups after mixing the data, from up to six different platforms in this study.
CuBlock can be downloaded from https://www.mathworks.com/matlabcentral/fileexchange/77882-cublock.
Supplementary data are available at Bioinformatics online.
基因表达微阵列数据的跨(多)平台归一化仍是一个未解决的问题。尽管存在多种算法,但它们要么因需要对所有平台的所有样本一起进行归一化而受到限制,从而损害了可扩展性和复用性,要么受限于特定供应商的平台,要么就是性能不佳。此外,文献中提出的许多方法尚未针对多平台数据和/或适用于此背景的其他方法进行专门测试。因此,我们着手开发一种归一化算法,适用于基于多个可能很大的微阵列集的基因表达研究,这些微阵列集是在多个平台上、不同时间收集的,适用于旨在从公共存储库中可用的大量微阵列数据中提取知识的系统研究;例如,用于提取真实世界数据以补充随机对照试验的数据。我们主要关注的性能标准是算法正确区分来自不同生物组样本的能力。
我们提出了一种实现这一目标的算法CuBlock,以及一种验证跨平台归一化方法的策略。为了验证该算法并将其与现有方法进行基准测试,我们使用了两个不同的数据集,一个是专门为测试和标准化目的生成的,另一个来自实际的实验研究。使用这些数据集,我们将CuBlock与ComBat(约翰逊等人,2007年)、UPC(皮科洛等人,2013年)、YuGene(勒曹等人,2014年)、DBNorm(孟等人,2017年)、Shambhala(鲍里索夫等人,2019年)以及简单的log2变换作为参考进行基准测试。我们注意到许多其他流行的归一化方法在此背景下不适用。在本研究中,CuBlock是该组中唯一一种在混合来自多达六个不同平台的数据后,总能清晰区分潜在生物组的算法。
CuBlock可从https://www.mathworks.com/matlabcentral/fileexchange/77882-cublock下载。
补充数据可在《生物信息学》在线获取。