Zhang Jiexin, Zhang Li, Coombes Kevin R
Department of Biostatistics and Applied Mathematics, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Box 447, Houston, TX 77030-4009, USA.
Bioinformatics. 2006 Feb 15;22(4):385-91. doi: 10.1093/bioinformatics/bti796. Epub 2005 Dec 8.
In the post-genomic era, developing tools to decode biological information from genomic sequences is important. Inspired by affiliation network theory, we investigated gene sequences of two kinds of UniGene clusters (UCs): narrowly expressed transcripts (NETs), whose expression is confined to a few tissues; and prevalently expressed transcripts (PETs) that are expressed in many tissues.
We explored the human and the mouse UniGene databases to compare NETs and PETs from different perspectives. We found that NETs were associated with smaller cluster size, shorter sequence length, a lower likelihood of having LocusLink annotations, and lower and more sporadic levels of expression. Significantly, the dinucleotide frequencies of NETs are similar to those of intergenic sequences in the genome, and they differ from those of PETs. We used these differences in dinucleotide frequencies to develop a discriminant analysis model to distinguish PETs from intergenic sequences.
Our results show that most NETs resemble intergenic sequences, casting doubts on the quality of such UniGene clusters. However, we also noted that a fraction of NETs resemble PETs in terms of dinucleotide frequencies and other features. Such NETs may have fewer quality problems. This work may be helpful in the studies of non-coding RNAs and in the validation of gene sequence databases.
在后基因组时代,开发从基因组序列中解码生物信息的工具非常重要。受归属网络理论的启发,我们研究了两种单基因簇(UCs)的基因序列:狭义表达转录本(NETs),其表达局限于少数组织;以及在许多组织中表达的普遍表达转录本(PETs)。
我们探索了人类和小鼠的单基因数据库,从不同角度比较了NETs和PETs。我们发现,NETs与较小的簇大小、较短的序列长度、具有LocusLink注释的可能性较低以及较低且更分散的表达水平相关。值得注意的是,NETs的二核苷酸频率与基因组中基因间序列的频率相似,且与PETs的频率不同。我们利用这些二核苷酸频率的差异开发了一种判别分析模型,以区分PETs和基因间序列。
我们的结果表明,大多数NETs类似于基因间序列,这对这类单基因簇的质量提出了质疑。然而,我们也注意到,一部分NETs在二核苷酸频率和其他特征方面类似于PETs。这类NETs可能存在较少的质量问题。这项工作可能有助于非编码RNA的研究以及基因序列数据库的验证。