从未比对的DNA片段中识别蛋白质结合位点。

Identifying protein-binding sites from unaligned DNA fragments.

作者信息

Stormo G D, Hartzell G W

机构信息

Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder 80309.

出版信息

Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183-7. doi: 10.1073/pnas.86.4.1183.

DOI:10.1073/pnas.86.4.1183

PMID:2919167

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC286650/

Abstract

The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor protein-binding sites, illustrates the method.

摘要

随着大规模测序项目的开展，仅从DNA序列本身确定其中重要特征的能力变得至关重要。我们提出了一种方法，该方法可应用于仅给定一组已测序的DNA片段来识别DNA结合蛋白识别模式的问题，已知每个片段内部某处都包含该蛋白的一个结合位点。不需要关于这些片段内结合位点的位置或方向的信息。该方法比较大量可能的结合位点比对的“信息含量”，以得出结合位点模式的矩阵表示。蛋白质的特异性表示为矩阵，而不是共有序列，从而能够识别调节蛋白结合位点典型的模式。该方法的可靠性随着序列数量的增加而提高，但所需时间仅与序列数量呈线性增加。使用已知的cAMP受体蛋白结合位点的一个例子说明了该方法。

相似文献

Identifying protein-binding sites from unaligned DNA fragments.从未比对的DNA片段中识别蛋白质结合位点。

Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183-7. doi: 10.1073/pnas.86.4.1183.

An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.一种用于识别和表征未比对生物聚合物序列中共有位点的期望最大化（EM）算法。

Proteins. 1990;7(1):41-51. doi: 10.1002/prot.340070105.

Identification of consensus patterns in unaligned DNA sequences known to be functionally related.在已知功能相关的未比对DNA序列中识别共有模式。

Comput Appl Biosci. 1990 Apr;6(2):81-92. doi: 10.1093/bioinformatics/6.2.81.

Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions.用于识别低特异性蛋白质 - DNA 相互作用的多字母一致性算法

Nucleic Acids Res. 1995 Apr 25;23(8):1434-40. doi: 10.1093/nar/23.8.1434.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.通过多条序列具有统计学意义的比对来识别DNA和蛋白质模式。

Bioinformatics. 1999 Jul-Aug;15(7-8):563-77. doi: 10.1093/bioinformatics/15.7.563.

The binding of the cyclic AMP receptor protein to synthetic DNA sites containing permutations in the consensus sequence TGTGA.环磷酸腺苷受体蛋白与包含共有序列TGTGA中排列变化的合成DNA位点的结合。

Biochem J. 1987 Aug 15;246(1):227-32. doi: 10.1042/bj2460227.

Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.用于从未比对的DNA片段中识别可变长度蛋白质结合位点的期望最大化算法。

J Mol Biol. 1992 Jan 5;223(1):159-70. doi: 10.1016/0022-2836(92)90723-w.

Mode of selectivity in cyclic AMP receptor protein-dependent promoters in Escherichia coli.大肠杆菌中依赖环磷酸腺苷受体蛋白的启动子的选择性模式。

Biochemistry. 1996 Jan 30;35(4):1162-72. doi: 10.1021/bi952187q.

Escherichia coli cAMP receptor protein-DNA complexes. 1. Energetic contributions of half-sites and flanking sequences in DNA recognition.大肠杆菌环磷酸腺苷受体蛋白-DNA复合物。1. DNA识别中半位点和侧翼序列的能量贡献。

Biochemistry. 1998 Apr 14;37(15):5194-200. doi: 10.1021/bi972450i.

Interplay between site-specific mutations and cyclic nucleotides in modulating DNA recognition by Escherichia coli cyclic AMP receptor protein.大肠杆菌环磷酸腺苷受体蛋白中位点特异性突变与环核苷酸在调节DNA识别过程中的相互作用。

Biochemistry. 2004 Jul 20;43(28):8901-10. doi: 10.1021/bi0499359.

引用本文的文献

De-motif sampling: an approach to decompose hierarchical motifs with applications in T cell recognition.去基序采样：一种分解层次化基序的方法及其在T细胞识别中的应用

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf221.

ShapeME: A tool and web front-end for de novo discovery of structural motifs underpinning protein-DNA interactions.ShapeME：一种用于从头发现支撑蛋白质 - DNA 相互作用的结构基序的工具及网络前端。

bioRxiv. 2025 Jan 31:2025.01.28.635290. doi: 10.1101/2025.01.28.635290.

Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery.植物转录因子结合区域发现中计算方法的综合分析

Heliyon. 2024 Oct 10;10(20):e39140. doi: 10.1016/j.heliyon.2024.e39140. eCollection 2024 Oct 30.

Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model.转录因子结合位点的位置特异性进化以及F81模型的快速似然计算。

R Soc Open Sci. 2024 Jan 24;11(1):231088. doi: 10.1098/rsos.231088. eCollection 2024 Jan.

Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks.利用深度卷积和循环神经网络发现非规范的 GRHL1 结合位点。

BMC Genomics. 2023 Dec 4;24(1):736. doi: 10.1186/s12864-023-09830-3.

distillation of thermodynamic affinity from deep learning regulatory sequence models of protein-DNA binding.从蛋白质 - DNA 结合的深度学习调控序列模型中提取热力学亲和力

bioRxiv. 2023 May 11:2023.05.11.540401. doi: 10.1101/2023.05.11.540401.

Minimal synthetic enhancers reveal control of the probability of transcriptional engagement and its timing by a morphogen gradient.最小合成增强子揭示了形态发生梯度对转录起始概率及其时间的控制。

Cell Syst. 2023 Mar 15;14(3):220-236.e3. doi: 10.1016/j.cels.2022.12.008. Epub 2023 Jan 24.

Explainability in transformer models for functional genomics.用于功能基因组学的转换器模型的可解释性。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab060.

A survey on deep learning in DNA/RNA motif mining.深度学习在 DNA/RNA 基序挖掘中的应用调查。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa229.

Review of Different Sequence Motif Finding Algorithms.不同序列基序查找算法综述。

Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.

本文引用的文献

The catabolite-sensitive promoter for the chloramphenicol acetyl transferase gene is preceded by two binding sites for the catabolite gene activator protein.氯霉素乙酰转移酶基因的分解代谢物敏感型启动子之前有两个分解代谢物基因激活蛋白的结合位点。

J Bacteriol. 1982 Apr;150(1):312-8. doi: 10.1128/jb.150.1.312-318.1982.

Cyclic AMP receptor protein: role in transcription activation.环磷酸腺苷受体蛋白：在转录激活中的作用。

Science. 1984 May 25;224(4651):831-8. doi: 10.1126/science.6372090.

Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity.大肠杆菌启动子序列可预测体外RNA聚合酶的选择性。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):789-800. doi: 10.1093/nar/12.1part2.789.

Computer methods to locate signals in nucleic acid sequences.在核酸序列中定位信号的计算机方法。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505-19. doi: 10.1093/nar/12.1part2.505.

Compilation and analysis of Escherichia coli promoter DNA sequences.大肠杆菌启动子DNA序列的汇编与分析

Nucleic Acids Res. 1983 Apr 25;11(8):2237-55. doi: 10.1093/nar/11.8.2237.

A perfectly symmetric lac operator binds the lac repressor very tightly.一个完全对称的乳糖操纵子紧密结合乳糖阻遏物。

Proc Natl Acad Sci U S A. 1983 Nov;80(22):6785-9. doi: 10.1073/pnas.80.22.6785.

A DNA sequence containing the control regions of the malEFG and malK-lamB operons in Escherichia coli K12.一段包含大肠杆菌K12中malEFG和malK-lamB操纵子控制区的DNA序列。

Mol Gen Genet. 1982;185(1):82-7. doi: 10.1007/BF00333794.

Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli.用于DNA序列的严格模式识别方法。大肠杆菌启动子序列分析。

J Mol Biol. 1985 Nov 5;186(1):117-28. doi: 10.1016/0022-2836(85)90262-1.

Multiple sequence alignment.多序列比对

J Mol Biol. 1986 Sep 20;191(2):153-61. doi: 10.1016/0022-2836(86)90252-4.

Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters.调控蛋白对DNA结合位点的选择。统计力学理论及其在操纵子和启动子中的应用。

J Mol Biol. 1987 Feb 20;193(4):723-50. doi: 10.1016/0022-2836(87)90354-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验