Zhou Kaiyue, Kottoori Bhagya Shree, Munj Seeya Awadhut, Zhang Zhewei, Draghici Sorin, Arslanturk Suzan
Department of Computer Science, Wayne State University, Detroit, MI 48201, USA.
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.
Biology (Basel). 2022 Feb 24;11(3):360. doi: 10.3390/biology11030360.
Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.
过去十年的研究产生了大量分子数据,这些数据可用于更好地理解癌症风险、进展和预后。然而,由于疾病的异质性,仅通过分析单一模式的数据无法了解进展风险,也无法区分长期和短期幸存者。使用一种经过科学开发和测试的深度学习方法,该方法利用从多个具有多种模式(如mRNA、DNA甲基化、miRNA)的数据库收集的汇总信息,可能会对疾病进展做出更准确、更可靠的预测。在此,我们提出一种基于自动编码器的多模态数据融合系统,其中融合编码器灵活地将多个研究中可用的集体信息与部分耦合的数据进行整合。我们在一项完全可控的基于模拟的研究中的结果表明,通过所提出的数据融合管道推断缺失数据,能够得到一个优于其他具有缺失模式的基线预测器的预测器。结果还进一步表明,多形性胶质母细胞瘤、急性髓系白血病和胰腺腺癌的短期和长期幸存者能够成功区分,其曲线下面积分别为0.94、0.75和0.96。