Amfoh K K, Shaw R F, Bonney G E
Division of Biostatistics, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111.
Biometrics. 1994 Dec;50(4):1054-63.
The development of the regressive logistic model applicable to the analysis of codon frequencies of DNA sequences in terms of explanatory variables is presented. A codon is a triplet of nucleotides that code for an amino acid, and may be considered as a trivariate response (B1, B2, B3), where Bi (i = 1, 2, 3) is a categorical random variable with values A, C, G, T. The linear order of bases in the DNA and possible statistical dependence of the bases in a given codon make the regressive logistic model a suitable tool for the analysis of codon frequencies. A problem of structural zeros arises from the fact that the stopping codons (terminators) do not code for amino acids; this is solved by normalizing the likelihood function. Codon frequencies may also depend on the function of the gene and they are known to differ between genes of the same genome. Differences also occur between synonymous codons for the same amino acid. Thus, the use of covariates that differ between synonymous codons as well as covariates that are constant within codons of the same amino acid may be useful in explaining the frequencies. As an illustration, the method is applied to the human mitochondrial genome using the following as explanatory variables: (1) TSCORE, a measure of the number of single base mutations required for a given codon to become a terminator; (2) AARISK, an indicator of a codon's ability of changing by a single base substitution to triplets coding for amino acids with very different characteristics; (3) AVDIST, a measure of the typicality of the amino acid coded for by the triplets. The results indicate that models that incorporate dependency structure and covariates are to be preferred to either the models comprising covariates alone or dependency structure alone.
本文介绍了一种适用于根据解释变量分析DNA序列密码子频率的回归逻辑模型的开发。密码子是编码氨基酸的三核苷酸三联体,可被视为一个三元响应(B1、B2、B3),其中Bi(i = 1, 2, 3)是一个取值为A、C、G、T的分类随机变量。DNA中碱基的线性顺序以及给定密码子中碱基可能的统计依赖性使得回归逻辑模型成为分析密码子频率的合适工具。由于终止密码子(终止子)不编码氨基酸这一事实会产生结构零值的问题;这可通过对似然函数进行归一化来解决。密码子频率也可能取决于基因的功能,并且已知在同一基因组的不同基因之间存在差异。对于相同氨基酸的同义密码子之间也存在差异。因此,使用同义密码子之间不同的协变量以及同一氨基酸密码子内恒定的协变量可能有助于解释频率。作为示例,该方法使用以下作为解释变量应用于人类线粒体基因组:(1)TSCORE,衡量给定密码子变为终止子所需的单碱基突变数量的指标;(2)AARISK,一个密码子通过单碱基替换变为编码具有非常不同特征氨基酸的三联体的能力指标;(3)AVDIST,衡量三联体编码的氨基酸典型性的指标。结果表明,包含依赖性结构和协变量的模型比仅包含协变量或仅包含依赖性结构的模型更可取。