Department of Pathology, Oslo University Hospital - Norwegian Radium Hospital, Oslo, Norway.
Department of Medical Biochemistry, Oslo University Hospital and University of Oslo, Oslo, Norway.
BMC Bioinformatics. 2022 Mar 3;23(1):83. doi: 10.1186/s12859-022-04615-z.
Transcription factor (TF) binding motifs are identified by high throughput sequencing technologies as means to capture Protein-DNA interactions. These motifs are often represented by consensus sequences in form of position weight matrices (PWMs). With ever-increasing pool of TF binding motifs from multiple sources, redundancy issues are difficult to avoid, especially when every source maintains its own database for collection. One solution can be to cluster biologically relevant or similar PWMs, whether coming from experimental detection or in silico predictions. However, there is a lack of efficient tools to cluster PWMs. Assessing quality of PWM clusters is yet another challenge. Therefore, new methods and tools are required to efficiently cluster PWMs and assess quality of clusters.
A new Python package Affinity Based Clustering for Position Weight Matrices (abc4pwm) was developed. It efficiently clustered PWMs from multiple sources with or without using DNA-Binding Domain (DBD) information, generated a representative motif for each cluster, evaluated the clustering quality automatically, and filtered out incorrectly clustered PWMs. Additionally, it was able to update human DBD family database automatically, classified known human TF PWMs to the respective DBD family, and performed TF motif searching and motif discovery by a new ensemble learning approach.
This work demonstrates applications of abc4pwm in the DNA sequence analysis for various high throughput sequencing data using ~ 1770 human TF PWMs. It recovered known TF motifs at gene promoters based on gene expression profiles (RNA-seq) and identified true TF binding targets for motifs predicted from ChIP-seq experiments. Abc4pwm is a useful tool for TF motif searching, clustering, quality assessment and integration in multiple types of sequence data analysis including RNA-seq, ChIP-seq and ATAC-seq.
转录因子(TF)结合基序是通过高通量测序技术来捕获蛋白质-DNA 相互作用而被鉴定的。这些基序通常以位置权重矩阵(PWMs)的共识序列形式表示。随着来自多个来源的 TF 结合基序数量的不断增加,冗余问题难以避免,尤其是当每个来源都维护自己的数据库进行收集时。一种解决方案是对具有生物学相关性或相似性的 PWMs 进行聚类,无论是来自实验检测还是计算预测。然而,目前缺乏有效的工具来对 PWMs 进行聚类。评估 PWM 聚类的质量也是另一个挑战。因此,需要新的方法和工具来有效地对 PWMs 进行聚类,并评估聚类的质量。
开发了一个新的 Python 包 Affinity Based Clustering for Position Weight Matrices (abc4pwm)。它可以有效地对来自多个来源的 PWMs 进行聚类,无论是否使用 DNA 结合结构域(DBD)信息,为每个聚类生成一个代表性基序,自动评估聚类质量,并过滤掉聚类错误的 PWMs。此外,它还能够自动更新人类 DBD 家族数据库,将已知的人类 TF PWM 分类到相应的 DBD 家族,并通过新的集成学习方法进行 TF 基序搜索和基序发现。
这项工作展示了 abc4pwm 在使用约 1770 个人类 TF PWM 的各种高通量测序数据的 DNA 序列分析中的应用。它根据基因表达谱(RNA-seq)从基因启动子中恢复了已知的 TF 基序,并从 ChIP-seq 实验预测的基序中鉴定了真正的 TF 结合靶标。abc4pwm 是 TF 基序搜索、聚类、质量评估和整合到多种类型的序列数据分析(包括 RNA-seq、ChIP-seq 和 ATAC-seq)中的有用工具。