Suppr超能文献

来自高通量DNA测序的超短读长数据集存在大量偏差。

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

作者信息

Dohm Juliane C, Lottaz Claudio, Borodina Tatiana, Himmelbauer Heinz

机构信息

Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany.

出版信息

Nucleic Acids Res. 2008 Sep;36(16):e105. doi: 10.1093/nar/gkn425. Epub 2008 Jul 26.

Abstract

Novel sequencing technologies permit the rapid production of large sequence data sets. These technologies are likely to revolutionize genetics and biomedical research, but a thorough characterization of the ultra-short read output is necessary. We generated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads. Wrong base calls are frequently preceded by base G. Base substitution error frequencies vary by 10- to 11-fold, with A > C transversion being among the most frequent and C > G transversions among the least frequent substitution errors. Insertions and deletions of single bases occur at very low rates. When simulating re-sequencing we found a 20-fold sequencing coverage to be sufficient to compensate errors by correct reads. The read coverage of the sequenced regions is biased; the highest read density was found in intervals with elevated GC content. High Solexa quality scores are over-optimistic and low scores underestimate the data quality. Our results show different types of biases and ways to detect them. Such biases have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.

摘要

新型测序技术能够快速生成大量序列数据集。这些技术可能会给遗传学和生物医学研究带来变革,但有必要对超短读长输出进行全面表征。我们生成并分析了两个Illumina 1G超短读长数据集,即来自甜菜基因组克隆的280万个27聚体读段和来自食蟹幽门螺杆菌基因组的1230万个36聚体读段。我们发现错误率从读段起始处的0.3%到末尾处的3.8%不等。错误碱基调用之前经常是碱基G。碱基替换错误频率相差10到11倍,其中A > C颠换是最常见的替换错误之一,而C > G颠换是最不常见的替换错误之一。单碱基的插入和缺失发生率极低。在模拟重测序时,我们发现20倍的测序覆盖度足以通过正确读段补偿错误。测序区域的读段覆盖度存在偏差;在GC含量升高的区间发现了最高的读段密度。高的Solexa质量分数过于乐观,而低分数则低估了数据质量。我们的结果显示了不同类型的偏差及其检测方法。此类偏差对Solexa数据在从头测序、重测序、单核苷酸多态性和DNA甲基化位点鉴定以及转录组分析中的使用和解读都有影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff4/2532726/d98991721a85/gkn425f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验