编码和非编码DNA序列分类的通用特征

Universal Features for the Classification of Coding and Non-coding DNA Sequences.

作者信息

Carels Nicolas, Vidal Ramon, Frías Diego

机构信息

Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

出版信息

Bioinform Biol Insights. 2009 Jun 3;3:37-49. doi: 10.4137/bbi.s2236.

DOI:10.4137/bbi.s2236

PMID:20140069

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2808180/

Abstract

In this report, we revisited simple features that allow the classification of coding sequences (CDS) from non-coding DNA. The spectrum of codon usage of our sequence sample is large and suggests that these features are universal. The features that we investigated combine (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine, Guanine, Adenine probabilities in 1st, 2nd, 3rd position of triplets, respectively, (iv) the product of G and C probabilities in 1st and 2nd position of triplets. These features are a natural consequence of the physico-chemical properties of proteins and their combination is successful in classifying CDS and non-coding DNA (introns) with a success rate >95% above 350 bp. The coding strand and coding frame are implicitly deduced when the sequences are classified as coding.

摘要

在本报告中，我们重新审视了一些简单特征，这些特征可用于从非编码DNA中分类编码序列（CDS）。我们序列样本的密码子使用谱范围广泛，表明这些特征具有普遍性。我们研究的特征包括：（i）终止密码子分布；（ii）核苷酸三联体三个位置上嘌呤概率的乘积；（iii）三联体第一、第二、第三位上胞嘧啶、鸟嘌呤、腺嘌呤概率的乘积；（iv）三联体第一和第二位上G和C概率的乘积。这些特征是蛋白质物理化学性质的自然结果，它们的组合成功地对CDS和非编码DNA（内含子）进行了分类，对于长度超过350 bp的序列，成功率>95%。当序列被分类为编码序列时，编码链和编码框会被隐含推导出来。