Department of Computer Science and Engineering, University of Connecticut, Fairfield Road, Storrs, CT 06269, USA.
BMC Bioinformatics. 2013 Mar 24;14:108. doi: 10.1186/1471-2105-14-108.
Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs.
We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence. Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparable performance with MEME in discovering motifs in ChIP-seq peak sequences.
We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively. The webtool is available at http://biogrid.engr.uconn.edu/lasagna_search/.
科学家通常会扫描 DNA 序列以寻找转录因子(TF)结合位点(TFBS)。大多数可用的工具都依赖于从对齐的结合位点构建的位置特异性评分矩阵(PSSM)。由于用于获取 TFBS 的测定方法的分辨率,TRANSFAC、ORegAnno 和 PAZAR 等数据库存储未对齐的可变长度 DNA 片段,其中包含 TF 的结合位点。这些 DNA 片段需要对齐以构建 PSSM。虽然 TRANSFAC 数据库为 TF 提供了评分矩阵,但在公共版本中,近 78%的 TF 没有可用的矩阵。由于 TFBS 对齐算法的工作有限,因此非常需要一种针对 TFBS 量身定制的对齐算法。
我们设计了一种名为 LASAGNA 的新算法,它可以识别输入 TFBS 的长度并利用位置依赖性。在 TRANSFAC 数据库中的 5 个物种的 189 个 TF 上的结果表明,我们的方法明显优于 ClustalW2 和 MEME。我们进一步比较了依赖于 LASAGNA 的 PSSM 方法和无对齐 TFBS 搜索方法。在 89 个其结合位点可以定位在基因组中的 TF 上的结果表明,在固定召回率下,我们的方法具有更高的精度。最后,我们描述了 LASAGNA-ChIP,这是一种用于 ChIP(染色质免疫沉淀)实验的更复杂版本。在一个序列一个的模型下,它在发现 ChIP-seq 峰序列中的基序方面与 MEME 表现相当。
我们得出结论,LASAGNA 算法在对齐可变长度结合位点方面简单有效。它已被集成到一个名为 LASAGNA-Search 的用户友好的 TFBS 搜索和可视化网络工具中。该工具目前为 189 个 TF 和分别来自 TRANSFAC Public 数据库(版本 7.0)和 ORegAnno 数据库(08Nov10 转储)的 133 个 TF 存储了预先计算的 PSSM 模型。该网络工具可在 http://biogrid.engr.uconn.edu/lasagna_search/ 访问。