Farrel Alvin, Guo Jun-Tao
Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.
BMC Bioinformatics. 2017 Jul 17;18(1):342. doi: 10.1186/s12859-017-1755-0.
Gene expression is regulated by transcription factors binding to specific target DNA sites. Understanding how and where transcription factors bind at genome scale represents an essential step toward our understanding of gene regulation networks. Previously we developed a structure-based method for prediction of transcription factor binding sites using an integrative energy function that combines a knowledge-based multibody potential and two atomic energy terms. While the method performs well, it is not computationally efficient due to the exponential increase in the number of binding sequences to be evaluated for longer binding sites. In this paper, we present an efficient pentamer algorithm by splitting DNA binding sequences into overlapping fragments along with a simplified integrative energy function for transcription factor binding site prediction.
A DNA binding sequence is split into overlapping pentamers (5 base pairs) for calculating transcription factor-pentamer interaction energy. To combine the results from overlapping pentamer scores, we developed two methods, Kmer-Sum and PWM (Position Weight Matrix) stacking, for full-length binding motif prediction. Our results show that both Kmer-Sum and PWM stacking in the new pentamer approach along with a simplified integrative energy function improved transcription factor binding site prediction accuracy and dramatically reduced computation time, especially for longer binding sites.
Our new fragment-based pentamer algorithm and simplified energy function improve both efficiency and accuracy. To our knowledge, this is the first fragment-based method for structure-based transcription factor binding sites prediction.
基因表达受转录因子与特定靶DNA位点结合的调控。了解转录因子在基因组规模上如何以及在何处结合是我们理解基因调控网络的关键一步。此前我们开发了一种基于结构的方法,利用结合基于知识的多体势和两个原子能项的综合能量函数来预测转录因子结合位点。虽然该方法表现良好,但由于对于更长的结合位点,待评估的结合序列数量呈指数增长,其计算效率不高。在本文中,我们提出了一种高效的五聚体算法,通过将DNA结合序列拆分为重叠片段,并结合简化的综合能量函数来预测转录因子结合位点。
将DNA结合序列拆分为重叠的五聚体(5个碱基对)以计算转录因子 - 五聚体相互作用能。为了合并重叠五聚体得分的结果,我们开发了两种方法,即Kmer - Sum和PWM(位置权重矩阵)堆叠,用于全长结合基序预测。我们的结果表明,新的五聚体方法中的Kmer - Sum和PWM堆叠以及简化的综合能量函数提高了转录因子结合位点预测的准确性,并显著减少了计算时间,尤其是对于更长的结合位点。
我们新的基于片段的五聚体算法和简化的能量函数提高了效率和准确性。据我们所知,这是第一种基于片段的用于基于结构的转录因子结合位点预测的方法。