Sarkar I Neil, Thornton Joseph W, Planet Paul J, Figurski David H, Schierwater Bernd, DeSalle Rob
Department of Medical Informatics, Columbia University College of Physicians and Surgeons, New York, NY, USA.
Mol Phylogenet Evol. 2002 Sep;24(3):388-99. doi: 10.1016/s1055-7903(02)00259-2.
When novel gene sequences are discovered, they are usually identified, classified, and annotated based on aggregate measures of sequence similarity. This method is prone to errors, however. Phylogenetic analysis is a more accurate basis for gene classification and ortholog identification, but it is relatively labor-intensive and computationally demanding. Here we report and demonstrate a rapid new method for gene classification based on phylogenetic principles. Given the phylogeny of a minimal sample of gene family members, our method automatically identifies amino acids that are phylogenetically characteristic of each class of sequences in the family; it then classifies a novel sequence based on the presence of these characteristic attributes in its sequence. Using a subset of homeobox protein sequences as a test case, we show that our method approximates classification based on full-scale phylogenetic analysis with very high accuracy in a tiny fraction of the time.
当发现新的基因序列时,通常会基于序列相似性的综合指标对其进行识别、分类和注释。然而,这种方法容易出错。系统发育分析是基因分类和直系同源基因识别更准确的基础,但它相对劳动密集且计算要求较高。在此,我们报告并展示了一种基于系统发育原理的快速基因分类新方法。给定基因家族成员最小样本的系统发育关系,我们的方法会自动识别出该家族中每类序列在系统发育上具有特征性的氨基酸;然后根据新序列中这些特征属性的存在情况对其进行分类。以同源异型框蛋白序列的一个子集作为测试案例,我们表明我们的方法在极短的时间内就能以非常高的准确率近似基于全面系统发育分析的分类。