Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150, Finland.
BMC Bioinformatics. 2020 Jul 20;21(1):317. doi: 10.1186/s12859-020-03621-3.
The binding sites of transcription factors (TFs) and the localisation of histone modifications in the human genome can be quantified by the chromatin immunoprecipitation assay coupled with next-generation sequencing (ChIP-seq). The resulting chromatin feature data has been successfully adopted for genome-wide enhancer identification by several unsupervised and supervised machine learning methods. However, the current methods predict different numbers and different sets of enhancers for the same cell type and do not utilise the pattern of the ChIP-seq coverage profiles efficiently.
In this work, we propose a PRobabilistic Enhancer PRedictIoN Tool (PREPRINT) that assumes characteristic coverage patterns of chromatin features at enhancers and employs a statistical model to account for their variability. PREPRINT defines probabilistic distance measures to quantify the similarity of the genomic query regions and the characteristic coverage patterns. The probabilistic scores of the enhancer and non-enhancer samples are utilised to train a kernel-based classifier. The performance of the method is demonstrated on ENCODE data for two cell lines. The predicted enhancers are computationally validated based on the transcriptional regulatory protein binding sites and compared to the predictions obtained by state-of-the-art methods.
PREPRINT performs favorably to the state-of-the-art methods, especially when requiring the methods to predict a larger set of enhancers. PREPRINT generalises successfully to data from cell type not utilised for training, and often the PREPRINT performs better than the previous methods. The PREPRINT enhancers are less sensitive to the choice of prediction threshold. PREPRINT identifies biologically validated enhancers not predicted by the competing methods. The enhancers predicted by PREPRINT can aid the genome interpretation in functional genomics and clinical studies.
转录因子(TFs)的结合位点和组蛋白修饰在人类基因组中的定位可以通过与下一代测序(ChIP-seq)相结合的染色质免疫沉淀检测来定量。通过几种无监督和有监督的机器学习方法,这些产生的染色质特征数据已成功应用于全基因组增强子识别。然而,目前的方法为同一细胞类型预测了不同数量和不同集合的增强子,并且没有有效地利用 ChIP-seq 覆盖谱模式。
在这项工作中,我们提出了一种概率增强子预测工具(PREPRINT),该工具假设增强子处染色质特征的特征覆盖模式,并采用统计模型来解释其可变性。PREPRINT 定义了概率距离度量来量化基因组查询区域和特征覆盖模式之间的相似性。增强子和非增强子样本的概率得分用于训练基于核的分类器。该方法的性能在两个细胞系的 ENCODE 数据上进行了演示。基于转录调控蛋白结合位点对预测的增强子进行了计算验证,并与最新方法的预测进行了比较。
PREPRINT 的表现优于最新方法,特别是在需要方法预测更大集合的增强子时。PREPRINT 成功推广到未用于训练的数据,并且通常 PREPRINT 的表现优于以前的方法。PREPRINT 预测的增强子对预测阈值的选择不敏感。PREPRINT 识别了以前方法未预测的具有生物学验证的增强子。PREPRINT 预测的增强子可以帮助功能基因组学和临床研究中的基因组解释。