Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy.
Department of Computing, Imperial College London, London, UK.
BMC Bioinformatics. 2022 Apr 26;23(1):151. doi: 10.1186/s12859-022-04687-x.
Histone Mark Modifications (HMs) are crucial actors in gene regulation, as they actively remodel chromatin to modulate transcriptional activity: aberrant combinatorial patterns of HMs have been connected with several diseases, including cancer. HMs are, however, reversible modifications: understanding their role in disease would allow the design of 'epigenetic drugs' for specific, non-invasive treatments. Standard statistical techniques were not entirely successful in extracting representative features from raw HM signals over gene locations. On the other hand, deep learning approaches allow for effective automatic feature extraction, but at the expense of model interpretation.
Here, we propose ShallowChrome, a novel computational pipeline to model transcriptional regulation via HMs in both an accurate and interpretable way. We attain state-of-the-art results on the binary classification of gene transcriptional states over 56 cell-types from the REMC database, largely outperforming recent deep learning approaches. We interpret our models by extracting insightful gene-specific regulative patterns, and we analyse them for the specific case of the PAX5 gene over three differentiated blood cell lines. Finally, we compare the patterns we obtained with the characteristic emission patterns of ChromHMM, and show that ShallowChrome is able to coherently rank groups of chromatin states w.r.t. their transcriptional activity.
In this work we demonstrate that it is possible to model HM-modulated gene expression regulation in a highly accurate, yet interpretable way. Our feature extraction algorithm leverages on data downstream the identification of enriched regions to retrieve gene-wise, statistically significant and dynamically located features for each HM. These features are highly predictive of gene transcriptional state, and allow for accurate modeling by computationally efficient logistic regression models. These models allow a direct inspection and a rigorous interpretation, helping to formulate quantifiable hypotheses.
组蛋白修饰(HMs)是基因调控的关键因素,因为它们积极重塑染色质以调节转录活性:异常的 HM 组合模式与包括癌症在内的多种疾病有关。然而,HM 是可逆转的修饰:了解它们在疾病中的作用将允许设计针对特定、非侵入性治疗的“表观遗传药物”。标准统计技术在从基因位置的原始 HM 信号中提取代表性特征方面并不完全成功。另一方面,深度学习方法允许有效自动提取特征,但以牺牲模型解释为代价。
在这里,我们提出了 ShallowChrome,这是一种新颖的计算管道,能够以准确和可解释的方式通过 HM 对转录调节进行建模。我们在 REMC 数据库中对来自 56 种细胞类型的基因转录状态进行二进制分类的任务上取得了最先进的结果,大大优于最近的深度学习方法。我们通过提取有见地的基因特异性调节模式来解释我们的模型,并针对 PAX5 基因在三个分化的血细胞系上的具体情况对其进行分析。最后,我们将我们获得的模式与 ChromHMM 的特征发射模式进行比较,并表明 ShallowChrome 能够根据它们的转录活性一致地对染色质状态组进行排序。
在这项工作中,我们证明了以高度准确但可解释的方式对 HM 调节的基因表达调控进行建模是可能的。我们的特征提取算法利用鉴定富含区域的下游数据来检索每个 HM 的基因特异性、具有统计学意义且动态定位的特征。这些特征对基因转录状态具有高度预测性,并允许通过计算效率高的逻辑回归模型进行准确建模。这些模型允许直接检查和严格解释,有助于制定可量化的假设。