Tognon Manuel, Kumbara Alisa, Betti Andrea, Ruggeri Lorenzo, Giugno Rosalba
Computer Science Department, University of Verona, Strada Le Grazie 15, Verona, VR 37134, Italy.
Department of Engineering for Innovation Medicine, University of Verona, Strada Le Grazie 15, Verona, VR 37134, Italy.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf363.
Transcription factors (TFs) are essential regulatory proteins controlling the cellular transcriptional states by binding to specific DNA sequences known as transcription factor binding sites (TFBSs) or motifs. Accurate TFBS identification is crucial for unraveling regulatory mechanisms driving cellular dynamics. Over the years, various computational approaches have been developed to model TFBSs, with position weight matrices (PWMs) being one of the most widely adopted methods. PWMs provide a probabilistic framework by representing nucleotide frequencies at every position within the binding site. While effective and interpretable, PWMs face significant limitations, such as their inability to capture positional dependencies or model complex interactions. To address these, advanced methods, like support vector machine (SVM)-based, and deep learning (DL)-based models, have been introduced. Leveraging human ChIP-seq data from ENCODE, we systematically benchmarked the predictive performance of PWM, SVM-, and DL-based models across different scenarios. We evaluate the impact of key factors such as training dataset size, sequence length, and kernel functions (for SVMs) on models' performance. Additionally, we explore the impact of synthetic versus real biological background data during model training. Our analysis highlights strengths and limitations of each approach under different conditions, providing practical guidance for selecting and tailoring models to specific biological datasets. To complement our analysis, we present a comprehensive database of pretrained SVM models for TFBS detection, trained on human ChIP-seq data from diverse cell lines and tissues. This resource aims to facilitate broader adoption of SVM-based methods in TFBS prediction and enhance their practical utility in regulatory genomics research.
转录因子(TFs)是一类重要的调控蛋白,通过与特定的DNA序列(称为转录因子结合位点(TFBSs)或基序)结合来控制细胞的转录状态。准确识别TFBS对于揭示驱动细胞动态变化的调控机制至关重要。多年来,人们开发了各种计算方法来对TFBS进行建模,其中位置权重矩阵(PWMs)是应用最广泛的方法之一。PWMs通过表示结合位点内每个位置的核苷酸频率提供了一个概率框架。虽然PWMs有效且可解释,但它们面临着重大局限性,例如无法捕捉位置依赖性或对复杂相互作用进行建模。为了解决这些问题,已经引入了先进的方法,如基于支持向量机(SVM)和基于深度学习(DL)的模型。利用来自ENCODE的人类ChIP-seq数据,我们系统地对基于PWM、SVM和DL的模型在不同场景下的预测性能进行了基准测试。我们评估了关键因素(如训练数据集大小、序列长度和内核函数(对于SVM))对模型性能的影响。此外,我们还探讨了模型训练期间合成背景数据与真实生物背景数据的影响。我们的分析突出了每种方法在不同条件下的优势和局限性,为针对特定生物数据集选择和定制模型提供了实用指导。为了补充我们的分析,我们提供了一个用于TFBS检测的预训练SVM模型的综合数据库,该数据库基于来自不同细胞系和组织的人类ChIP-seq数据进行训练。该资源旨在促进基于SVM的方法在TFBS预测中的更广泛应用,并提高其在调控基因组学研究中的实际效用。