Suppr超能文献

单个基因序列的信息内容。

Information content of individual genetic sequences.

作者信息

Schneider T D

机构信息

National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, P.O. Box B, Frederick, MD 21702-1201, USA.

出版信息

J Theor Biol. 1997 Dec 21;189(4):427-41. doi: 10.1006/jtbi.1997.0540.

Abstract

Related genetic sequences having a common function can be described by Shannon's information measure and depicted graphically by a sequence logo. Though useful for many purposes, sequence logos only show the average sequence conservation, and inferring the conservation for individual sequences is difficult. This limitation is overcome by the individual information ( R i) technique described here. The method begins by generating a weight matrix from the frequencies of each nucleotide or amino acid at each position of the aligned sequences. This matrix is then applied to the sequences themselves to determine the sequence conservation of each individual sequence. The matrix is unique because the average of these assignments is the total sequence conservation, ad there is only one way to construct such a matrix. For binding sites on polynucleotides, the weight matrix has a natural cut off that distinguishes functional sequences from other sequences. R i values are on an absolute scale measured in bits of information so the conservation of different biological functions can be compared with one another. The matrix can be used to rank-order the sequences, to search for new sequences, to compare sequences to other quantitative data such as binding energy or distance between binding sites, to distinguish mutations from polymorphisms, to design sequences of a given strength, and to detect errors in databases. The R i method has been used to identify previously undescribed but experimentally verified DNA binding sites. The individual information distribution was determined for E. coli ribosome binding sites, bacterial Fis binding sites, and human donor and acceptor splice junctions, among others. The distributions demonstrate clearly that the consensus sequence is highly unusual, and hence is a poor method to describe naturally occurring binding sites.

摘要

具有共同功能的相关基因序列可用香农信息测度来描述,并用序列标识以图形方式呈现。序列标识虽在许多方面有用,但仅显示平均序列保守性,难以推断单个序列的保守性。本文所述的个体信息(Ri)技术克服了这一局限性。该方法首先根据比对序列中每个位置上每种核苷酸或氨基酸的频率生成一个权重矩阵。然后将此矩阵应用于序列本身,以确定每个单独序列的序列保守性。该矩阵是唯一的,因为这些赋值的平均值就是总序列保守性,而且构建这样一个矩阵只有一种方法。对于多核苷酸上的结合位点,权重矩阵有一个自然的截止值,可将功能序列与其他序列区分开来。Ri值是以信息比特为单位的绝对尺度,因此可以相互比较不同生物学功能的保守性。该矩阵可用于对序列进行排序、搜索新序列、将序列与其他定量数据(如结合能或结合位点之间的距离)进行比较、区分突变与多态性、设计给定强度的序列以及检测数据库中的错误。Ri方法已用于识别先前未描述但经实验验证的DNA结合位点。已确定了大肠杆菌核糖体结合位点、细菌Fis结合位点以及人类供体和受体剪接位点等的个体信息分布。这些分布清楚地表明,共有序列非常不寻常,因此是描述天然存在的结合位点的一种糟糕方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验