利用信息论和松弛方法对无间隙DNA序列进行快速多重比对。

Fast multiple alignment of ungapped DNA sequences using information theory and a relaxation method.

作者信息

Schneider Thomas D, Mastronarde David N

机构信息

National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, P. O. Box B, Frederick, MD 21702-1201.

出版信息

Discrete Appl Math. 1996 Dec 1;71(1-3):259-268. doi: 10.1016/S0166-218X(96)00068-6.

DOI:10.1016/S0166-218X(96)00068-6

PMID:19953199

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2785095/

Abstract

An information theory based multiple alignment ("Malign") method was used to align the DNA binding sequences of the OxyR and Fis proteins, whose sequence conservation is so spread out that it is difficult to identify the sites. In the algorithm described here, the information content of the sequences is used as a unique global criterion for the quality of the alignment. The algorithm uses look-up tables to avoid recalculating computationally expensive functions such as the logarithm. Because there are no arbitrary constants and because the results are reported in absolute units (bits), the best alignment can be chosen without ambiguity. Starting from randomly selected alignments, a hill-climbing algorithm can track through the immense space of s(n) combinations where s is the number of sequences and n is the number of positions possible for each sequence. Instead of producing a single alignment, the algorithm is fast enough that one can afford to use many start points and to classify the solutions. Good convergence is indicated by the presence of a single well-populated solution class having higher information content than other classes. The existence of several distinct classes for the Fis protein indicates that those binding sites have self-similar features.

摘要

一种基于信息论的多重比对（“Malign”）方法被用于比对OxyR和Fis蛋白的DNA结合序列，这些序列的保守性分布得非常分散，以至于难以识别位点。在此描述的算法中，序列的信息含量被用作比对质量的唯一全局标准。该算法使用查找表来避免重新计算计算成本高昂的函数，如对数函数。由于没有任意常数，并且结果以绝对单位（比特）报告，因此可以明确无误地选择最佳比对。从随机选择的比对开始，爬山算法可以在s(n)组合的巨大空间中进行跟踪，其中s是序列的数量，n是每个序列可能的位置数量。该算法不是产生单个比对，而是速度足够快，以至于可以使用许多起始点并对解决方案进行分类。单个信息含量高于其他类别的密集填充的解决方案类别的存在表明收敛良好。Fis蛋白存在几个不同的类别，这表明那些结合位点具有自相似特征。

相似文献

Fast multiple alignment of ungapped DNA sequences using information theory and a relaxation method.利用信息论和松弛方法对无间隙DNA序列进行快速多重比对。

Discrete Appl Math. 1996 Dec 1;71(1-3):259-268. doi: 10.1016/S0166-218X(96)00068-6.

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.使用过滤的间隔字匹配作为锚点，对远缘基因组序列进行精确的多重比对。

Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.

Using CLUSTAL for multiple sequence alignments.使用CLUSTAL进行多序列比对。

Methods Enzymol. 1996;266:383-402. doi: 10.1016/s0076-6879(96)66024-8.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II：一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

From analysis of protein structural alignments toward a novel approach to align protein sequences.从蛋白质结构比对分析到一种比对蛋白质序列的新方法。

Proteins. 2004 Feb 15;54(3):569-82. doi: 10.1002/prot.10503.

Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score.Fr-TM-align：一种基于片段比对和TM分数的新型蛋白质结构比对方法。

BMC Bioinformatics. 2008 Dec 12;9:531. doi: 10.1186/1471-2105-9-531.

Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Genomic multiple sequence alignments: refinement using a genetic algorithm.基因组多序列比对：使用遗传算法进行优化

BMC Bioinformatics. 2005 Aug 8;6:200. doi: 10.1186/1471-2105-6-200.

MARS: improving multiple circular sequence alignment using refined sequences.MARS：使用优化序列改进多重环状序列比对

BMC Genomics. 2017 Jan 14;18(1):86. doi: 10.1186/s12864-016-3477-5.

MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences.MUSCA：一种用于多数据序列约束比对的算法。

Genome Inform Ser Workshop Genome Inform. 1998;9:112-119.

引用本文的文献

Analysis of plant metabolomics data using identification-free approaches.使用无鉴定方法分析植物代谢组学数据。

Appl Plant Sci. 2025 Mar 1;13(4):e70001. doi: 10.1002/aps3.70001. eCollection 2025 Jul-Aug.

Information theory tests critical predictions of plant defense theory for specialized metabolism.信息论检验了植物防御理论对特殊代谢的关键预测。

Sci Adv. 2020 Jun 10;6(24):eaaz0381. doi: 10.1126/sciadv.aaz0381. eCollection 2020 Jun.

Genome-Wide Transcriptional Regulation and Chromosome Structural Arrangement by GalR in .GalR在……中的全基因组转录调控和染色体结构排列

Front Mol Biosci. 2016 Nov 16;3:74. doi: 10.3389/fmolb.2016.00074. eCollection 2016.

Trends in information theory-based chemical structure codification.基于信息论的化学结构编码趋势。

Mol Divers. 2014 Aug;18(3):673-86. doi: 10.1007/s11030-014-9517-7. Epub 2014 Apr 5.

Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest.信息压缩利用基因组组成的模式来区分种群并突出进化感兴趣的区域。

BMC Bioinformatics. 2014 Mar 7;15:66. doi: 10.1186/1471-2105-15-66.

Data Compression Concepts and Algorithms and their Applications to Bioinformatics.数据压缩概念、算法及其在生物信息学中的应用。

Entropy (Basel). 2010 Jan 1;12(1):34. doi: 10.3390/e12010034.

Discovery of novel tumor suppressor p53 response elements using information theory.利用信息论发现新型肿瘤抑制因子p53反应元件

Nucleic Acids Res. 2008 Jun;36(11):3828-33. doi: 10.1093/nar/gkn189. Epub 2008 May 21.

The average mutual information profile as a genomic signature.作为基因组特征的平均互信息概况。

BMC Bioinformatics. 2008 Jan 25;9:48. doi: 10.1186/1471-2105-9-48.

Twenty Years of Delila and Molecular Information Theory: The Altenberg-Austin Workshop in Theoretical Biology Biological Information, Beyond Metaphor: Causality, Explanation, and Unification Altenberg, Austria, 11-14 July 2002.《德利拉与分子信息理论二十年：2002年7月11日至14日于奥地利阿尔滕贝格举行的理论生物学阿尔滕贝格-奥斯汀研讨会——超越隐喻的生物信息：因果关系、解释与统一》

Biol Theory. 2006;1(3):250-260. doi: 10.1162/biot.2006.1.3.250.

Discovery of Fur binding site clusters in Escherichia coli by information theory models.利用信息论模型发现大肠杆菌中的Fur结合位点簇。

Nucleic Acids Res. 2007;35(20):6762-77. doi: 10.1093/nar/gkm631. Epub 2007 Oct 5.

本文引用的文献

Information analysis of sequences that bind the replication initiator RepA.与复制起始蛋白RepA结合的序列的信息分析

J Mol Biol. 1993 Sep 20;233(2):219-30. doi: 10.1006/jmbi.1993.1501.

Sequence alignment and penalty choice. Review of concepts, case studies and implications.序列比对与罚分选择。概念回顾、案例研究及影响

J Mol Biol. 1994 Jan 7;235(1):1-12. doi: 10.1016/s0022-2836(05)80006-3.

Redox-dependent shift of OxyR-DNA contacts along an extended DNA-binding site: a mechanism for differential promoter selection.OxyR与DNA的结合沿扩展的DNA结合位点发生氧化还原依赖性移位：一种差异启动子选择机制。

Cell. 1994 Sep 9;78(5):897-909. doi: 10.1016/s0092-8674(94)90702-1.

A multiple sequence comparison method.一种多序列比对方法。

Bull Math Biol. 1993 Mar;55(2):465-86. doi: 10.1007/BF02460892.

A design for computer nucleic-acid-sequence storage, retrieval, and manipulation.一种用于计算机核酸序列存储、检索和操作的设计。

Nucleic Acids Res. 1982 May 11;10(9):3013-24. doi: 10.1093/nar/10.9.3013.

Delila system tools.德利拉系统工具。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):129-40. doi: 10.1093/nar/12.1part1.129.

Compilation and analysis of Escherichia coli promoter DNA sequences.大肠杆菌启动子DNA序列的汇编与分析

Nucleic Acids Res. 1983 Apr 25;11(8):2237-55. doi: 10.1093/nar/11.8.2237.

Information content of binding sites on nucleotide sequences.核苷酸序列上结合位点的信息内容。

J Mol Biol. 1986 Apr 5;188(3):415-31. doi: 10.1016/0022-2836(86)90165-8.

Identifying protein-binding sites from unaligned DNA fragments.从未比对的DNA片段中识别蛋白质结合位点。

Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183-7. doi: 10.1073/pnas.86.4.1183.

CAP binding sites reveal pyrimidine-purine pattern characteristic of DNA bending.CAP结合位点揭示了DNA弯曲所特有的嘧啶-嘌呤模式。

J Biomol Struct Dyn. 1990 Oct;8(2):213-32. doi: 10.1080/07391102.1990.10507803.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验