Stapleton Mark, Carlson Joe, Brokstein Peter, Yu Charles, Champe Mark, George Reed, Guarin Hannibal, Kronmiller Brent, Pacleb Joanne, Park Soo, Wan Ken, Rubin Gerald M, Celniker Susan E
Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Genome Biol. 2002;3(12):RESEARCH0080. doi: 10.1186/gb-2002-3-12-research0080. Epub 2002 Dec 23.
A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.
We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing.
We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence.
全长cDNA序列集合对于功能基因组学研究以及基因内含子-外显子结构的确定都是重要资源。为黑腹果蝇研究群体提供这一资源一直是伯克利果蝇基因组计划的长期目标。我们之前描述过果蝇基因集(DGC),这是一组推定的全长cDNA,它是通过生成和分析来自多种组织及发育阶段的超过250,000个表达序列标签(EST)产生的。
我们已为DGC中的8921个克隆生成了高质量的全插入序列。我们将这些克隆的序列与注释的第3版基因组序列进行比较,鉴定出5300多个包含完整且准确蛋白质编码序列的cDNA。这对应于至少一种预测的黑腹果蝇基因剪接形式的40%。我们还鉴定出了潜在的RNA编辑新情况。
我们表明,将cDNA序列与高质量注释的基因组序列进行比较是从cDNA集合中识别和消除有缺陷克隆并确保其用于实验的有效方法。克隆被消除的原因要么是它们存在单核苷酸差异(这很可能是逆转录酶错误导致的),要么是它们被截短且仅包含部分蛋白质编码序列。