BGI-Shenzen, Bei Shan Industrial Zone, Yantian District, Shenzhen 518083, China.
Plant J. 2012 Nov;72(3):461-73. doi: 10.1111/j.1365-313X.2012.05093.x. Epub 2012 Aug 14.
Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species.
亚麻(Linum usitatissimum)是一种古老的作物,被广泛种植作为纤维、油和药用相关化合物的来源。为了加速作物改良,我们对亚麻的核基因组进行了全基因组鸟枪法测序。使用 Illumina 基因组分析仪对大小为 300bp 至 10kb 的 7 个配对末端文库进行测序。从头组装仅由深度覆盖(约 94×原始,约 69×过滤)短序列读取(44-100bp)组成,产生了一组 N(50)=694kb 的支架,其中包含 N(50)=20.1kb 的 contigs。该 contig 组装包含 302Mb 的非冗余序列,代表估计的 81%基因组覆盖。多达 96%的已发表的亚麻 ESTs 与全基因组鸟枪法支架对齐。然而,与独立测序的 BACs 和 fosmids 的比较表明,在基因组范围内某些区域存在组装错误。在全基因组鸟枪法组装中预测了 43384 个编码蛋白的基因,多达 93%的已发表的亚麻 ESTs 和 86%的拟南芥基因与这些预测基因对齐,表明在基因水平上具有出色的覆盖度和准确性。对在重复基因对中观察到的同义替代率(K(s))的分析与亚麻最近(5-9 MYA)的全基因组复制一致。在预测的蛋白质组中,我们观察到许多保守结构域(Pfam-A)的富集,这些结构域可能有助于该作物的独特特性,包括凝集素蛋白。这些结果表明,仅基于全基因组鸟枪法短序列读取的从头组装是获得某些植物物种几乎完整基因组序列信息的有效方法。