Suppr超能文献

基于注释的全基因组 SNP 发现利用下一代测序技术在没有参考基因组序列的情况下在大型复杂的粗山羊草基因组中

Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence.

机构信息

Department of Plant Sciences, University of California, Davis, CA 95616, USA.

出版信息

BMC Genomics. 2011 Jan 25;12:59. doi: 10.1186/1471-2164-12-59.

Abstract

BACKGROUND

Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence.

RESULTS

An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated.

CONCLUSION

An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml).

摘要

背景

许多植物的基因组庞大而复杂,其中富含大量重复序列。许多植物还是多倍体。这些特征都是拟南芥科基因组结构的特点,其成员包括具有重要经济价值的小麦、黑麦和大麦。大的基因组大小、大量的重复序列和多倍体使得使用下一代测序(NGS)对总基因组 DNA 进行全基因组 SNP 发现变得具有挑战性,尤其是在没有参考基因组序列的情况下。

结果

报告了一种基于注释的全基因组 SNP 发现方法,该方法使用没有参考基因组序列的大型复杂基因组的 NGS 数据。为了区分单拷贝序列和重复序列以及来自旁系同源基因的重复序列和共享序列,对具有低基因组覆盖率的一个基因型的罗氏 454 shotgun 读数进行注释。然后,用 SOLiD 或 Solexa 生成的另一个基因型的多个基因组当量的 shotgun 读数被映射到注释的罗氏 454 读数上,以识别可能的 SNP。开发了一个管道程序包 AGSNP,并用于在 Ae. tauschii 中进行全基因组 SNP 发现,Ae. tauschii 是小麦 D 基因组的二倍体来源,其基因组大小为 4.02Gb,其中 90%是重复序列。用罗氏 454 NGS 平台对 Ae. tauschii 品系 AL8/78 的基因组 DNA 进行测序。用 SOLiD 主要对 Ae. tauschii 品系 AS75 的基因组 DNA 和 cDNA 进行测序,尽管也生成了一些 Solexa 和罗氏 454 基因组序列。在基因序列中发现了 195631 个可能的 SNP,在未表征的单拷贝区发现了 155580 个可能的 SNP,在重复接头处发现了 145907 个可能的 SNP。这些 SNP 分散在整个 Ae. tauschii 基因组中。为了评估假阳性 SNP 发现率,从 AL8/78 和 AS75 中通过 PCR 扩增含有所谓 SNP 的 DNA,并使用 ABI 3730 xl 对其进行重新测序。在随机选择的 302 个所谓 SNP 中,84.0%在基因区,88.0%在重复接头处,81.3%在未表征区得到验证。

结论

开发了一种基于注释的全基因组 SNP 发现方法,适用于复杂基因组基因组文库中的 SNP 发现,不需要参考基因组序列。该方法适用于所有当前的 NGS 平台,只要至少有一种这样的平台能够生成相对较长的读数。该管道程序包 AGSNP 和发现的 497118 个 Ae. tauschii SNP 可在 (http://avena.pw.usda.gov/wheatD/agsnp.shtml) 上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b090/3041743/1eef2cf4f3e7/1471-2164-12-59-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验