Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA.
Nucleic Acids Res. 2014 Apr;42(8):4800-12. doi: 10.1093/nar/gku132. Epub 2014 Feb 12.
Cys(2)-His(2) zinc finger proteins (ZFPs) are the largest family of transcription factors in higher metazoans. They also represent the most diverse family with regards to the composition of their recognition sequences. Although there are a number of ZFPs with characterized DNA-binding preferences, the specificity of the vast majority of ZFPs is unknown and cannot be directly inferred by homology due to the diversity of recognition residues present within individual fingers. Given the large number of unique zinc fingers and assemblies present across eukaryotes, a comprehensive predictive recognition model that could accurately estimate the DNA-binding specificity of any ZFP based on its amino acid sequence would have great utility. Toward this goal, we have used the DNA-binding specificities of 678 two-finger modules from both natural and artificial sources to construct a random forest-based predictive model for ZFP recognition. We find that our recognition model outperforms previously described determinant-based recognition models for ZFPs, and can successfully estimate the specificity of naturally occurring ZFPs with previously defined specificities.
Cys(2)-His(2) 锌指蛋白(ZFPs)是高等后生动物中转录因子家族中最大的一类。它们也是组成识别序列的最多样化的家族。尽管有许多 ZFPs 的 DNA 结合偏好已被确定,但由于单个手指中存在的识别残基的多样性,绝大多数 ZFPs 的特异性是未知的,并且不能直接通过同源性推断。鉴于真核生物中存在大量独特的锌指和组装体,如果能够基于其氨基酸序列准确估计任何 ZFP 的 DNA 结合特异性,那么这样一个全面的预测识别模型将具有很大的实用性。为了实现这一目标,我们使用了来自天然和人工来源的 678 个双指模块的 DNA 结合特异性,构建了一个基于随机森林的 ZFP 识别预测模型。我们发现,我们的识别模型优于以前描述的基于决定因素的 ZFP 识别模型,并且可以成功估计具有先前定义特异性的天然存在的 ZFP 的特异性。