Banting and Best Department of Medical Research and Donnelly Centre, University of Toronto, Toronto, Ontario, Canada.
Nat Biotechnol. 2013 Feb;31(2):126-34. doi: 10.1038/nbt.2486. Epub 2013 Jan 27.
Genomic analyses often involve scanning for potential transcription factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's DNA-binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For nine TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro-derived motifs performed similarly to motifs derived from the in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10% of the TFs examined here). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.
基因组分析通常涉及使用 DNA 结合蛋白序列特异性模型来扫描潜在的转录因子 (TF) 结合位点。已经开发了许多方法来对蛋白质的 DNA 结合特异性进行建模和学习,但这些方法尚未得到系统比较。在这里,我们应用了 26 种这样的方法对 66 种属于不同家族的小鼠 TF 的体外蛋白质结合微阵列数据进行了分析。对于 9 个 TF,我们还在体内数据上对得到的基序模型进行了评分,发现体外衍生的最佳基序模型与从体内数据得到的基序模型表现相似。我们的结果表明,对于大多数所检查的 TF,基于最佳方法训练的单核苷酸位置权重矩阵的简单模型与更复杂的模型表现相似,但在特定情况下(<10%的所检查的 TF)表现不佳。此外,表现最好的基序通常具有相对较低的信息含量,这与真核生物 TF 序列偏好中的广泛简并性一致。