Kalafut Noah Cohen, Huang Xiang, Wang Daifeng
Department of Computer Sciences, Wisconsin, US.
Waisman Center, University of Wisconsin-Madison, Wisconsin, US.
Nat Mach Intell. 2023 Jun;5(6):631-642. doi: 10.1038/s42256-023-00663-z. Epub 2023 May 29.
Single-cell multimodal datasets have measured various characteristics of individual cells, enabling a deep understanding of cellular and molecular mechanisms. However, multimodal data generation remains costly and challenging, and missing modalities happen frequently. Recently, machine learning approaches have been developed for data imputation but typically require fully matched multimodalities to learn common latent embeddings that potentially lack modality specificity. To address these issues, we developed an open-source machine learning model, Joint Variational Autoencoders for multimodal Imputation and Embedding (JAMIE). JAMIE takes single-cell multimodal data that can have partially matched samples across modalities. Variational autoencoders learn the latent embeddings of each modality. Then, embeddings from matched samples across modalities are aggregated to identify joint cross-modal latent embeddings before reconstruction. To perform cross-modal imputation, the latent embeddings of one modality can be used with the decoder of the other modality. For interpretability, Shapley values are used to prioritize input features for cross-modal imputation and known sample labels. We applied JAMIE to both simulation data and emerging single-cell multimodal data including gene expression, chromatin accessibility, and electrophysiology in human and mouse brains. JAMIE significantly outperforms existing state-of-the-art methods in general and prioritized multimodal features for imputation, providing potentially novel mechanistic insights at cellular resolution.
单细胞多模态数据集已经测量了单个细胞的各种特征,从而能够深入了解细胞和分子机制。然而,多模态数据生成仍然成本高昂且具有挑战性,并且模态缺失的情况经常发生。最近,已经开发了机器学习方法用于数据插补,但通常需要完全匹配的多模态来学习潜在的共同嵌入,而这些嵌入可能缺乏模态特异性。为了解决这些问题,我们开发了一种开源机器学习模型,用于多模态插补和嵌入的联合变分自编码器(JAMIE)。JAMIE采用单细胞多模态数据,这些数据在不同模态之间可以有部分匹配的样本。变分自编码器学习每个模态的潜在嵌入。然后,来自不同模态匹配样本的嵌入被聚合起来,以在重建之前识别联合跨模态潜在嵌入。为了进行跨模态插补,一个模态的潜在嵌入可以与另一个模态的解码器一起使用。为了便于解释,使用Shapley值对跨模态插补和已知样本标签的输入特征进行优先级排序。我们将JAMIE应用于模拟数据和新兴的单细胞多模态数据,包括人类和小鼠大脑中的基因表达、染色质可及性和电生理学数据。总体而言,JAMIE在性能上显著优于现有的最先进方法,并为插补确定了多模态特征优先级,在细胞分辨率上提供了潜在的新颖机制见解