Borisov Nicolas, Tkachev Victor, Simonov Alexander, Sorokin Maxim, Kim Ella, Kuzmin Denis, Karademir-Yilmaz Betul, Buzdin Anton
Omicsway Corp, Walnut, CA, United States.
Moscow Institute of Physics and Technology, Dolgoprudny, Russia.
Front Mol Biosci. 2023 Sep 6;10:1237129. doi: 10.3389/fmolb.2023.1237129. eCollection 2023.
Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.
使用不同实验平台和方案获得的RNA谱的共标准化为全面比较相关特征(如与疾病相关的差异表达基因)开辟了道路。目前,大多数生物信息学工具都能以灵活的格式进行标准化,这种格式取决于所分析的各个数据集。因此,这种标准化的输出数据彼此之间的兼容性很差。最近,我们提出了一种新的基因表达数据标准化方法,称为香巴拉(Shambhala),它以统一的形状返回协调后的数据,其中每个表达谱都被转换为预定义的通用格式。我们之前表明,在对人类RNA谱进行香巴拉化之后,整体组织特异性聚类特征得到了强烈保留,而平台特异性聚类则显著减少。在这里,我们测试了香巴拉在保留基因表达倍数变化特征以及基因簇的其他功能特征(如通路激活水平和预测的癌症药物活性评分)方面的性能。利用文献和实验数据集中的6793个癌症和11135个正常组织基因表达谱,我们对不同版本的香巴拉以及其他具有灵活输出数据格式的转录组协调方法应用了12个性能标准。这些标准涉及生物类型分类器、层次聚类、相关/回归特性、药物效率评分的稳定性以及使用机器学习分类器的数据质量。与其他方法相比,香巴拉 - 2协调器在训练数据集与验证数据集的比较中显示出最佳结果,相关系数和线性回归系数接近1,并且在计算药物效率评分时的不稳定性降低了两倍多。