Laboratório Nacional de Computação Científica - LNCC, Avenida Getúlio Vargas, Petrópolis, Rio de Janeiro 25651075, Brazil.
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Carretera Sierra Papacal, Mérida 97302, Yucatán, México.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae581.
Transcriptional factors (TFs) in bacteria play a crucial role in gene regulation by binding to specific DNA sequences, thereby assisting in the activation or repression of genes. Despite their central role, deciphering shape recognition of bacterial TFs-DNA interactions remains an intricate challenge. A deeper understanding of DNA secondary structures could greatly enhance our knowledge of how TFs recognize and interact with DNA, thereby elucidating their biological function. In this study, we employed machine learning algorithms to predict transcription factor binding sites (TFBS) and classify them as directed-repeat (DR) or inverted-repeat (IR). To accomplish this, we divided the set of TFBS nucleotide sequences by size, ranging from 8 to 20 base pairs, and converted them into thermodynamic data known as DNA duplex stability (DDS). Our results demonstrate that the Random Forest algorithm accurately predicts TFBS with an average accuracy of over 82% and effectively distinguishes between IR and DR with an accuracy of 89%. Interestingly, upon converting the base pairs of several TFBS-IR into DDS values, we observed a symmetric profile typical of the palindromic structure associated with these architectures. This study presents a novel TFBS prediction model based on a DDS characteristic that may indicate how respective proteins interact with base pairs, thus providing insights into molecular mechanisms underlying bacterial TFs-DNA interaction.
细菌中的转录因子 (TFs) 通过与特定的 DNA 序列结合,在基因调控中起着至关重要的作用,从而协助基因的激活或抑制。尽管它们具有核心作用,但破译细菌 TFs-DNA 相互作用的形状识别仍然是一个复杂的挑战。深入了解 DNA 二级结构可以极大地增强我们对 TFs 如何识别和与 DNA 相互作用的理解,从而阐明它们的生物学功能。在这项研究中,我们使用机器学习算法来预测转录因子结合位点 (TFBS),并将其分类为定向重复 (DR) 或反向重复 (IR)。为了实现这一目标,我们根据大小将 TFBS 核苷酸序列分为 8 到 20 个碱基对的范围,并将其转换为称为 DNA 双链体稳定性 (DDS) 的热力学数据。我们的结果表明,随机森林算法可以准确预测 TFBS,平均准确率超过 82%,并且可以有效地将 IR 和 DR 区分开来,准确率为 89%。有趣的是,当我们将几个 TFBS-IR 的碱基对转换为 DDS 值时,我们观察到了一种与这些结构相关的典型回文结构的对称特征。本研究提出了一种基于 DDS 特征的新型 TFBS 预测模型,该模型可能表明相应的蛋白质如何与碱基对相互作用,从而深入了解细菌 TFs-DNA 相互作用的分子机制。