Xiao Yong-Li, Malik Mukesh, Whitelaw Catherine A, Town Christopher D
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA.
Plant Physiol. 2002 Dec;130(4):2118-28. doi: 10.1104/pp.010207.
About 25% of the genes in the fully sequenced and annotated Arabidopsis genome have structures that are predicted solely by computer algorithms with no support from either nucleic acid or protein homologs from other species or expressed sequence matches from Arabidopsis. These are referred to as "hypothetical genes." On chromosome 2, sequenced by The Institute for Genomic Research, there are approximately 800 hypothetical genes among a total of approximately 4,100 genes. To test their expression under various growth conditions and in specific tissues, we used six cDNA populations prepared from cold-treated, heat-treated, and pathogen (Xanthomonas campestris pv campestris)-infected plants, callus, roots, and young seedlings. To date, 169 hypothetical genes were tested, and 138 of them are found to be expressed in one or more of the six cDNA populations. By sequencing multiple clones from each 5'- and 3'-rapid amplification of cDNA ends (RACE) product and assembling the sequences, we generated full-length sequences for 16 of these genes. For 14 genes, there was one full-length assembly that precisely supported the intron-exon boundaries of their gene predictions, adding only 5'- and 3'-untranslated region sequences. However, for three of these genes, the other assemblies represent additional exons and alternatively spliced or unspliced introns. For the remaining two genes, the cDNA sequences reveal major differences with predicted gene structures. In addition, a total of six genes displayed more than one polyadenylation site. These data will be used to update gene models in The Institute for Genomic Research annotation database ATH1.
在已完成全序列测定和注释的拟南芥基因组中,约25%的基因结构仅由计算机算法预测得出,没有来自其他物种的核酸或蛋白质同源物的支持,也没有拟南芥的表达序列匹配。这些被称为“假设基因”。在由基因组研究所测序的2号染色体上,总共约4100个基因中大约有800个假设基因。为了测试它们在各种生长条件下和特定组织中的表达情况,我们使用了从经过冷处理、热处理和病原体(野油菜黄单胞菌野油菜致病变种)感染的植物、愈伤组织、根和幼苗中制备的六个cDNA文库。到目前为止,对169个假设基因进行了测试,发现其中138个在六个cDNA文库中的一个或多个中表达。通过对每个5'-和3'-cDNA末端快速扩增(RACE)产物的多个克隆进行测序并组装序列,我们获得了其中16个基因的全长序列。对于14个基因,有一个全长组装精确支持了其基因预测的内含子-外显子边界,仅增加了5'-和3'-非翻译区序列。然而,对于其中三个基因,其他组装代表了额外的外显子以及可变剪接或未剪接的内含子。对于其余两个基因,cDNA序列显示出与预测基因结构的重大差异。此外,总共有六个基因显示出不止一个聚腺苷酸化位点。这些数据将用于更新基因组研究所注释数据库ATH1中的基因模型。