Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA.
Department of Computer Science and Engineering, The Ohio State University College of Engineering, 2015 Neil Avenue, Columbus, OH, 43210, USA.
BMC Bioinformatics. 2021 Jan 30;22(1):35. doi: 10.1186/s12859-021-03976-1.
Assigning chromatin states genome-wide (e.g. promoters, enhancers, etc.) is commonly performed to improve functional interpretation of these states. However, computational methods to assign chromatin state suffer from the following drawbacks: they typically require data from multiple assays, which may not be practically feasible to obtain, and they depend on peak calling algorithms, which require careful parameterization and often exclude the majority of the genome. To address these drawbacks, we propose a novel learning technique built upon the Self-Organizing Map (SOM), Self-Organizing Map with Variable Neighborhoods (SOM-VN), to learn a set of representative shapes from a single, genome-wide, chromatin accessibility dataset to associate with a chromatin state assignment in which a particular RE is prevalent. These shapes can then be used to assign chromatin state using our workflow.
We validate the performance of the SOM-VN workflow on 14 different samples of varying quality, namely one assay each of A549 and GM12878 cell lines and two each of H1 and HeLa cell lines, primary B-cells, and brain, heart, and stomach tissue. We show that SOM-VN learns shapes that are (1) non-random, (2) associated with known chromatin states, (3) generalizable across sets of chromosomes, and (4) associated with magnitude and multimodality. We compare the accuracy of SOM-VN chromatin states against the Clustering Aggregation Tool (CAGT), an unsupervised method that learns chromatin accessibility signal shapes but does not associate these shapes with REs, and we show that overall precision and recall is increased when learning shapes using SOM-VN as compared to CAGT. We further compare enhancer state assignments from SOM-VN in signals above a set threshold to enhancer state assignments from Predicting Enhancers from ATAC-seq Data (PEAS), a deep learning method that assigns enhancer chromatin states to peaks. We show that the precision-recall area under the curve for the assignment of enhancer states is comparable to PEAS.
Our work shows that the SOM-VN workflow can learn relationships between REs and chromatin accessibility signal shape, which is an important step toward the goal of assigning and comparing enhancer state across multiple experiments and phenotypic states.
对全基因组的染色质状态(例如启动子、增强子等)进行分配通常用于提高对这些状态的功能解释。 然而,分配染色质状态的计算方法存在以下缺点:它们通常需要来自多个测定的数据集,这在实际中可能无法获得,并且它们依赖于峰调用算法,这些算法需要仔细的参数化,并且通常排除基因组的大部分。 为了解决这些缺点,我们提出了一种基于自组织图(SOM)的新学习技术,该技术基于自组织图具有可变邻域(SOM-VN),从单个全基因组染色质可及性数据集中学习一组代表性形状,以关联与特定 RE 普遍存在的染色质状态分配。 然后可以使用我们的工作流程使用这些形状来分配染色质状态。
我们在 14 个不同质量的样本上验证了 SOM-VN 工作流程的性能,即 A549 和 GM12878 细胞系的每种测定各一个,H1 和 HeLa 细胞系的每种测定各两个,以及原代 B 细胞和脑,心脏和胃组织。 我们表明,SOM-VN 学习的形状是(1)非随机的,(2)与已知的染色质状态相关,(3)可在染色体组之间推广,以及(4)与幅度和多模态相关。 我们将 SOM-VN 染色质状态的准确性与聚类聚合工具(CAGT)进行了比较,CAGT 是一种学习染色质可及性信号形状但不将这些形状与 RE 相关联的无监督方法,我们表明,与 CAGT 相比,使用 SOM-VN 学习形状可以提高整体精度和召回率。 我们进一步将 SOM-VN 在设定阈值以上的信号中的增强子状态分配与从 ATAC-seq 数据预测增强子(PEAS)的增强子状态分配进行了比较,PEAS 是一种将增强子染色质状态分配给峰的深度学习方法。 我们表明,分配增强子状态的精度-召回曲线下面积与 PEAS 相当。
我们的工作表明,SOM-VN 工作流程可以学习 RE 和染色质可及性信号形状之间的关系,这是在多个实验和表型状态之间分配和比较增强子状态的目标的重要步骤。