Collado-Vides J
Department of Biology, Massachusetts Institute of Technology, Cambridge 02139.
Proc Natl Acad Sci U S A. 1992 Oct 15;89(20):9405-9. doi: 10.1073/pnas.89.20.9405.
Based on a formal proof that justifies the search for generative grammars in the study of gene regulation, a linguistic formalization of an exhaustive data base of Escherichia coli sigma 70 promoters and their regulatory binding sites has been initiated. The grammar presented here generates all the arrays of the collection plus those that are predicted as consistent with the principles of regulation of sigma 70 promoters. "Systems of regulation," sets of regulatory sites that collaborate in a mechanism of regulation, are represented by means of syntactic categories. A small set of phrase structure rules restricted by an X-bar principle and by a hierarchical, c-command relation generates a representation of arrays of sites of regulation where the selection of the protein(s) identifying the system(s) of regulation occurs. Based on the features of the proteins, optional duplicated proximal and remote sites are generated by means of transformational rules. Consistency with the data, the predictions that the grammar generates, and important similarities and differences with some aspects of the generative theory of natural language are discussed.
基于一项为在基因调控研究中寻找生成语法提供正当理由的形式证明,已启动对大肠杆菌σ70启动子及其调控结合位点的详尽数据库进行语言学形式化的工作。这里呈现的语法生成了该集合的所有阵列,以及那些根据σ70启动子调控原则预测的阵列。“调控系统”,即在一种调控机制中协同作用的调控位点集,通过句法类别来表示。一组受X杠原则和层次化的c-统制关系限制的短语结构规则生成了调控位点阵列的一种表示,其中识别调控系统的蛋白质的选择在此发生。基于蛋白质的特征,通过转换规则生成可选的重复近端和远端位点。讨论了与数据的一致性、语法生成的预测,以及与自然语言生成理论某些方面的重要异同。