Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy.
Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy.
Bioinformatics. 2022 Feb 7;38(5):1183-1190. doi: 10.1093/bioinformatics/btab815.
Approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) represent the standard for the identification of binding sites of DNA-associated proteins, including transcription factors and histone marks. Public repositories of omics data contain a huge number of experimental ChIP-seq data, but their reuse and integrative analysis across multiple conditions remain a daunting task.
We present the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE), an efficient computational method able to integrate and take advantage of the valuable and numerous, but heterogeneous, ChIP-seq data publicly available in big data repositories. Leveraging natural language processing techniques, it integrates omics data samples with semantic annotations from selected biomedical ontologies; then, using hidden Markov models, it identifies combinations of static and dynamic functional elements throughout the genome for the corresponding samples. CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them. Moreover, it allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations. Such findings can provide novel insights by identifying unexpected combinations of functional elements in different biological conditions.
The Python implementation of the CombSAFE pipeline is freely available for non-commercial use at: https://github.com/DEIB-GECO/CombSAFE.
Supplementary data are available at Bioinformatics online.
诸如染色质免疫沉淀 followed by sequencing (ChIP-seq) 等方法是鉴定 DNA 相关蛋白(包括转录因子和组蛋白标记物)结合位点的标准方法。组学数据的公共存储库包含大量的实验 ChIP-seq 数据,但它们在多个条件下的重复使用和综合分析仍然是一项艰巨的任务。
我们提出了 Combinatorial and Semantic Analysis of Functional Elements (CombSAFE),这是一种高效的计算方法,能够整合和利用大数据存储库中公开的大量但异构的 ChIP-seq 数据。它利用自然语言处理技术,将组学数据样本与来自选定生物医学本体的语义注释集成;然后,使用隐马尔可夫模型,为相应的样本识别整个基因组中静态和动态功能元素的组合。CombSAFE 允许通过聚类具有相似功能元素的区域模式,并通过富集分析来发现与它们显著相关的本体术语,从而分析整个基因组。此外,它还允许比较特定基因组区域的功能状态,以分析它们在各种语义注释中的不同行为。通过在不同的生物条件下识别功能元素的意外组合,可以提供新的见解。
CombSAFE 管道的 Python 实现可在非商业用途下免费使用:https://github.com/DEIB-GECO/CombSAFE。
补充数据可在 Bioinformatics 在线获取。