用于稀疏 scChIP-seq 数据插补的单细胞特异性和可解释的机器学习模型。

Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation.

机构信息

Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany.

Institute of Molecular Biology, Mainz, Germany.

出版信息

PLoS One. 2022 Jul 1;17(7):e0270043. doi: 10.1371/journal.pone.0270043. eCollection 2022.

DOI:10.1371/journal.pone.0270043

PMID:35776722

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9249201/

Abstract

MOTIVATION

Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors.

RESULTS

Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA's interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets.

AVAILABILITY AND IMPLEMENTATION

Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA.

摘要

动机

单细胞染色质免疫沉淀 DNA 测序（scChIP-seq）分析由于数据稀疏而具有挑战性。生物高通量单细胞数据的高度稀疏性通常采用填补数据的插补方法来处理，但缺乏 scChIP-seq 的特定方法。我们提出了 SIMPA，这是一种 scChIP-seq 数据插补方法，利用 ENCODE 项目中批量数据中的预测信息来插补目标组蛋白标记或转录因子的缺失蛋白-DNA 相互作用区域。

结果

使用针对每个单细胞、每个 ChIP 蛋白靶标和每个基因组区域训练的机器学习模型进行插补，可以准确地保留细胞类型聚类，并在真实人类数据上提高与途径相关的基因识别。在模拟单细胞的批量数据上的结果表明，插补是单细胞特异性的，因为插补的图谱与模拟细胞比与同一 ChIP 蛋白靶标和同一细胞类型相关的其他细胞更接近。模拟还表明，100 个输入基因组区域已经足以训练用于插补数千个未检测到的区域的单细胞特异性模型。此外，SIMPA 通过揭示给定单细胞对特定基因组区域训练的插补模型最重要的相互作用位点，使机器学习模型的解释成为可能。从激活组蛋白标记 H3K4me3 的启动子相互作用谱中得出的相应特征重要值与 2 个真实人类和小鼠数据集中原位存在的细胞类型特异性途径中的基因的共表达高度相关。SIMPA 的可解释插补方法允许用户深入了解单个细胞，从而深入了解稀疏 scChIP-seq 数据集。