de la Fuente Jesús, Legarra-Marcos Naroa, Diaz-Mazkiaran Aintzane, Serrano Guillermo, Marín-Goñi Irene, Sendin Markel Benito, Osta Ana García, Kalari Krishna R, Fernandez-Granda Carlos, Ochoa Idoia, Hernaez Mikel
Department of Biomedical Engineering and Science, Tecnun School of Engineering, University of Navarra, 20018 San Sebastian, Spain.
CIMA, University of Nava rra, IdiSNA, 31008 Pamplona, Spain.
Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf821.
Deconvolution models are a powerful tool for extracting cell-type-specific information from bulk gene expression profiles. Current methods leverage advanced machine learning models and high-resolution sequencing, like single-cell RNA-sequencing, showing promising results across diverse tissues and conditions. However, they still present important limitations: First, many depend on selecting a robust reference, which can strongly affect the deconvolution. Second, pseudobulk data used for training and real bulk RNA-seq samples often exhibit strong distribution shifts, which are currently unaccounted for. Finally, most deconvolution approaches behave as black boxes, which can compromise the reliability of the results. Here, we present Sweetwater, an adaptive and interpretable autoencoder that efficiently deconvolves bulk samples leveraging multiple classes of reference data. Moreover, we propose an improved way of generating training data from a mixture of FACS-sorted FASTQ files, reducing platform-specific biases and outperforming current single-cell-based references. Furthermore, we introduce a gold standard dataset to facilitate fair and accurate evaluation of deconvolution approaches. Finally, we demonstrate that Sweetwater adapts effectively to deconvolved samples during training, uncovering biologically meaningful patterns and enhancing result's reliability. Sweetwater is available at https://doi.org/10.6084/m9.figshare.29609180, and we anticipate it will expedite the accurate examination of high-throughput clinical data across diverse applications.
反卷积模型是一种从批量基因表达谱中提取细胞类型特异性信息的强大工具。当前的方法利用先进的机器学习模型和高分辨率测序技术,如单细胞RNA测序,在不同组织和条件下都显示出了有前景的结果。然而,它们仍然存在重要的局限性:首先,许多方法依赖于选择一个可靠的参考,这可能会对反卷积产生强烈影响。其次,用于训练的伪批量数据和真实的批量RNA测序样本通常表现出强烈的分布偏移,目前尚未得到考虑。最后,大多数反卷积方法就像黑箱一样,这可能会损害结果的可靠性。在这里,我们提出了Sweetwater,一种自适应且可解释的自动编码器,它利用多类参考数据有效地对批量样本进行反卷积。此外,我们提出了一种改进的方法,从FACS分选的FASTQ文件混合物中生成训练数据,减少平台特异性偏差并优于当前基于单细胞的参考。此外,我们引入了一个金标准数据集,以促进对反卷积方法进行公平准确的评估。最后,我们证明Sweetwater在训练过程中能有效地适应反卷积样本,揭示生物学上有意义的模式并提高结果的可靠性。Sweetwater可在https://doi.org/10.6084/m9.figshare.29609180获取,我们预计它将加快对各种应用中的高通量临床数据进行准确检查的速度。