Mallik Saurav, Sarkar Anasua, Nath Sagnik, Maulik Ujjwal, Das Supantha, Pati Soumen Kumar, Ghosh Soumadip, Zhao Zhongming
Department of Environmental Health, Harvard T H Chan School of public Health, Boston, MA, United States.
Department of Computer Science & Engineering, Jadavpur University, Kolkata, India.
Front Genet. 2023 Feb 14;14:1095330. doi: 10.3389/fgene.2023.1095330. eCollection 2023.
In this current era, biomedical big data handling is a challenging task. Interestingly, the integration of multi-modal data, followed by significant feature mining (gene signature detection), becomes a daunting task. Remembering this, here, we proposed a novel framework, namely, three-factor penalized, non-negative matrix factorization-based multiple kernel learning with soft margin hinge loss (3PNMF-MKL) for multi-modal data integration, followed by gene signature detection. In brief, limma, employing the empirical Bayes statistics, was initially applied to each individual molecular profile, and the statistically significant features were extracted, which was followed by the three-factor penalized non-negative matrix factorization method used for data/matrix fusion using the reduced feature sets. Multiple kernel learning models with soft margin hinge loss had been deployed to estimate average accuracy scores and the area under the curve (AUC). Gene modules had been identified by the consecutive analysis of average linkage clustering and dynamic tree cut. The best module containing the highest correlation was considered the potential gene signature. We utilized an acute myeloid leukemia cancer dataset from The Cancer Genome Atlas (TCGA) repository containing five molecular profiles. Our algorithm generated a 50-gene signature that achieved a high classification AUC score (viz., 0.827). We explored the functions of signature genes using pathway and Gene Ontology (GO) databases. Our method outperformed the state-of-the-art methods in terms of computing AUC. Furthermore, we included some comparative studies with other related methods to enhance the acceptability of our method. Finally, it can be notified that our algorithm can be applied to any multi-modal dataset for data integration, followed by gene module discovery.
在当今时代,生物医学大数据处理是一项具有挑战性的任务。有趣的是,多模态数据的整合,随后进行重要的特征挖掘(基因特征检测),成为一项艰巨的任务。牢记这一点,在此我们提出了一种新颖的框架,即基于三因素惩罚、非负矩阵分解的带有软间隔铰链损失的多核学习(3PNMF-MKL),用于多模态数据整合及后续的基因特征检测。简而言之,首先将采用经验贝叶斯统计的limma应用于每个单独的分子谱,提取具有统计学意义的特征,随后使用三因素惩罚非负矩阵分解方法,利用缩减后的特征集进行数据/矩阵融合。已部署带有软间隔铰链损失的多核学习模型来估计平均准确率得分和曲线下面积(AUC)。通过连续分析平均连锁聚类和动态树切割来识别基因模块。包含最高相关性的最佳模块被视为潜在的基因特征。我们使用了来自癌症基因组图谱(TCGA)存储库的急性髓系白血病癌症数据集,其中包含五种分子谱。我们的算法生成了一个50基因特征,其分类AUC得分较高(即0.827)。我们使用通路和基因本体(GO)数据库探索了特征基因的功能。在计算AUC方面,我们的方法优于现有方法。此外,我们纳入了与其他相关方法的一些比较研究,以提高我们方法的可接受性。最后,可以注意到我们的算法可应用于任何多模态数据集进行数据整合及后续的基因模块发现。