Agosti D, Jacobs D, DeSalle R
Department of Entomology, American Museum of Natural History, New York 10024, USA.
Cladistics. 1996;12:65-82. doi: 10.1111/j.1096-0031.1996.tb00193.x.
Amino acid encoding genes contain character state information that may be useful for phylogenetic analysis on at least two levels. The nucleotide sequence and the translated amino acid sequences have both been employed separately as character states for cladistic studies of various taxa, including studies of the genealogy of genes in multigene families. In essence, amino acid sequences and nucleic acid sequences are two different ways of character coding the information in a gene. Silent positions in the nucleotide sequence (first or third positions in codons that can accrue change without changing the identity of the amino acid that the triplet codes for) may accrue change relatively rapidly and become saturated, losing the pattern of historical divergence. On the other hand, non-silent nucleotide alterations and their accompanying amino acid changes may evolve too slowly to reveal relationships among closely related taxa. In general, the dynamics of sequence change in silent and non-silent positions in protein coding genes result in homoplasy and lack of resolution, respectively. We suggest that the combination of nucleic acid and the translated amino acid coded character states into the same data matrix for phylogenetic analysis addresses some of the problems caused by the rapid change of silent nucleotide positions and overall slow rate of change of non-silent nucleotide positions and slowly changing amino acid positions. One major theoretical problem with this approach is the apparent non-independence of the two sources of characters. However, there are at least three possible outcomes when comparing protein coding nucleic acid sequences with their translated amino acids in a phylogenetic context on a codon by codon basis. First, the two character sets for a codon may be entirely congruent with respect to the information they convey about the relationships of a certain set of taxa. Second, one character set may display no information concerning a phylogenetic hypothesis while the other character set may impact information to a hypothesis. These two possibilities are cases of non-independence, however, we argue that congruence in such cases can be thought of as increasing the weight of the particular phylogenetic hypothesis that is supported by those characters. In the third case, the two sources of character information for a particular codon may be entirely incongruent with respect to phylogenetic hypotheses concerning the taxa examined. In this last case the two character sets are independent in that information from neither can predict the character states of the other. Examples of these possibilities are discussed and the general applicability of combining these two sources of information for protein coding genes is presented using sequences from the homeobox region of 46 homeobox genes from Drosophila melanogaster to develop a hypothesis of genealogical relationship of these genes in this large multigene family.
氨基酸编码基因包含的特征状态信息,至少在两个层面上可能对系统发育分析有用。核苷酸序列和翻译后的氨基酸序列都已分别用作各种分类群分支系统学研究的特征状态,包括多基因家族中基因谱系的研究。本质上,氨基酸序列和核酸序列是对基因中的信息进行特征编码的两种不同方式。核苷酸序列中的沉默位点(密码子中的第一位或第三位,其变化不会改变三联体编码的氨基酸的身份)可能变化相对较快并趋于饱和,从而失去历史分歧模式。另一方面,非沉默核苷酸改变及其伴随的氨基酸变化可能进化得太慢,无法揭示密切相关分类群之间的关系。一般来说,蛋白质编码基因中沉默和非沉默位点的序列变化动态分别导致了平行进化和缺乏分辨率。我们认为,将核酸和翻译后的氨基酸编码特征状态组合到同一个数据矩阵中进行系统发育分析,可以解决由沉默核苷酸位点的快速变化以及非沉默核苷酸位点和缓慢变化的氨基酸位点总体变化速率缓慢所引起的一些问题。这种方法的一个主要理论问题是这两种特征来源明显不独立。然而,在系统发育背景下逐个密码子地比较蛋白质编码核酸序列及其翻译后的氨基酸时,至少有三种可能的结果。首先,一个密码子的两个特征集在它们所传达的关于某一组分类群关系的信息方面可能完全一致。其次,一个特征集可能不显示关于系统发育假设的任何信息,而另一个特征集可能会对一个假设产生影响。然而,这两种可能性是非独立的情况,我们认为在这种情况下的一致性可以被视为增加了由这些特征支持的特定系统发育假设的权重。在第三种情况下,特定密码子的两个特征信息来源可能在关于所研究分类群的系统发育假设方面完全不一致。在最后这种情况下,两个特征集是独立的,因为来自任何一个的信息都无法预测另一个的特征状态。讨论了这些可能性的例子,并使用来自黑腹果蝇46个同源异型框基因同源异型框区域的序列,提出了这两种信息来源组合对于蛋白质编码基因的一般适用性,以建立这个大多基因家族中这些基因谱系关系的假设。