Division of Biology, California Institute of Technology, Pasadena, California 91125, USA.
Genome Res. 2010 Dec;20(12):1740-7. doi: 10.1101/gr.111021.110. Epub 2010 Oct 27.
Efficient sequencing of animal and plant genomes by next-generation technology should allow many neglected organisms of biological and medical importance to be better understood. As a test case, we have assembled a draft genome of Caenorhabditis sp. 3 PS1010 through a combination of direct sequencing and scaffolding with RNA-seq. We first sequenced genomic DNA and mixed-stage cDNA using paired 75-nt reads from an Illumina GAII. A set of 230 million genomic reads yielded an 80-Mb assembly, with a supercontig N50 of 5.0 kb, covering 90% of 429 kb from previously published genomic contigs. Mixed-stage poly(A)(+) cDNA gave 47.3 million mappable 75-mers (including 5.1 million spliced reads), which separately assembled into 17.8 Mb of cDNA, with an N50 of 1.06 kb. By further scaffolding our genomic supercontigs with cDNA, we increased their N50 to 9.4 kb, nearly double the average gene size in C. elegans. We predicted 22,851 protein-coding genes, and detected expression in 78% of them. Multigenome alignment and data filtering identified 2672 DNA elements conserved between PS1010 and C. elegans that are likely to encode regulatory sequences or previously unknown ncRNAs. Genomic and cDNA sequencing followed by joint assembly is a rapid and useful strategy for biological analysis.
通过下一代技术对动植物基因组进行高效测序,应该可以让许多被忽视的具有生物学和医学重要性的生物体得到更好的理解。作为一个测试案例,我们通过直接测序和 RNA 测序的组合,组装了 Caenorhabditis sp. 3 PS1010 的草图基因组。我们首先使用 Illumina GAII 的 75-nt 配对读取来测序基因组 DNA 和混合阶段 cDNA。一组 2.3 亿个基因组读取产生了一个 80-Mb 的组装,其超级连丝 N50 为 5.0 kb,覆盖了先前发表的基因组连丝的 429 kb 的 90%。混合阶段 poly(A)(+) cDNA 给出了 4730 万个可映射的 75-mers(包括 510 万个拼接读取),它们分别组装成 17.8 Mb 的 cDNA,N50 为 1.06 kb。通过进一步用 cDNA 对我们的基因组超级连丝进行支架构建,我们将其 N50 增加到 9.4 kb,几乎是秀丽隐杆线虫平均基因大小的两倍。我们预测了 22851 个编码蛋白的基因,并在其中 78%的基因中检测到了表达。多基因组比对和数据过滤鉴定了 PS1010 和秀丽隐杆线虫之间的 2672 个 DNA 元件,它们可能编码调控序列或以前未知的 ncRNA。基因组和 cDNA 测序,然后进行联合组装,是一种快速而有用的生物学分析策略。