Deutsch M, Long M
Department of Ecology and Evolution, The University of Chicago, 1101 East 57th Street, Chicago, IL 60637, USA.
Nucleic Acids Res. 1999 Aug 1;27(15):3219-28. doi: 10.1093/nar/27.15.3219.
To investigate the distribution of intron-exon structures of eukaryotic genes, we have constructed a general exon database comprising all available intron-containing genes and exon databases from 10 eukaryotic model organisms: Homo sapiens, Mus musculus, Gallus gallus, Rattus norvegicus, Arabidopsis thaliana, Zea mays, Schizosaccharomyces pombe, Aspergillus, Caenorhabditis elegans and Drosophila. We purged redundant genes to avoid the possible bias brought about by redundancy in the databases. After discarding those questionable introns that do not contain correct splice sites, the final database contained 17 102 introns, 21 019 exons and 2903 independent or quasi-independent genes. On average, a eukaryotic gene contains 3.7 introns per kb protein coding region. The exon distribution peaks around 30-40 residues and most introns are 40-125 nt long. The variable intron-exon structures of the 10 model organisms reveal two interesting statistical phenomena, which cast light on some previous speculations. (i) Genome size seems to be correlated with total intron length per gene. For example, invertebrate introns are smaller than those of human genes, while yeast introns are shorter than invertebrate introns. However, this correlation is weak, suggesting that other factors besides genome size may also affect intron size. (ii) Introns smaller than 50 nt are significantly less frequent than longer introns, possibly resulting from a minimum intron size requirement for intron splicing.
为了研究真核基因内含子 - 外显子结构的分布,我们构建了一个通用外显子数据库,该数据库包含所有可用的含内含子基因以及来自10种真核模式生物的外显子数据库,这10种生物分别是:智人、小家鼠、原鸡、褐家鼠、拟南芥、玉米、粟酒裂殖酵母、曲霉、秀丽隐杆线虫和果蝇。我们去除了冗余基因,以避免数据库冗余可能带来的偏差。在舍弃那些不包含正确剪接位点的可疑内含子后,最终数据库包含17102个内含子、21019个外显子和2903个独立或准独立基因。平均而言,一个真核基因每千碱基蛋白质编码区域含有3.7个内含子。外显子分布在30 - 40个残基左右达到峰值,且大多数内含子长度为40 - 125个核苷酸。这10种模式生物可变的内含子 - 外显子结构揭示了两个有趣的统计现象,为之前的一些推测提供了线索。(i)基因组大小似乎与每个基因的内含子总长度相关。例如,无脊椎动物的内含子比人类基因的内含子小,而酵母的内含子比无脊椎动物的内含子短。然而,这种相关性较弱,表明除基因组大小外的其他因素也可能影响内含子大小。(ii)小于50个核苷酸的内含子出现频率明显低于较长的内含子,这可能是由于内含子剪接存在最小内含子大小要求所致。