Suppr超能文献

无参数据校验。

Reference-free validation of short read data.

机构信息

Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia.

出版信息

PLoS One. 2010 Sep 22;5(9):e12681. doi: 10.1371/journal.pone.0012681.

Abstract

BACKGROUND

High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked.

RESULTS

We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others.

CONCLUSIONS

The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.

摘要

背景

高通量 DNA 测序技术能够快速、廉价地对整个基因组等材料进行测序。然而,这些技术产生的短读数据在测序过程的几个阶段可能存在偏差或受到影响;这些偏差的一些来源和特性并不总是为人所知。为了进行实验质量控制、基因组组装和覆盖结果解释,需要准确评估偏差。另一个挑战是,对于新的基因组或来自未知来源的材料,可能没有参考基因组来检查这些读取序列。

结果

我们提出了一种无需参考基因组即可识别短读集中偏差的分析方法。这些方法与现有方法相结合,构成了一种可以用来量化一组读取质量的方法。我们的方法涉及使用三种不同的度量标准:碱基调用分析、k-mer 分析和 k-mer 分布分析。我们将我们的方法应用于广泛的短读数据,并表明,令人惊讶的是,似乎存在很强的偏差。这些偏差包括某些多碱基序列的严重过度表示、某些碱基的位置偏差以及某些起始位置的明显偏好。

结论

短读数据中存在偏差是已知的,但它们似乎比之前文献中发现的更大且更多样化。对一组短读序列进行统计分析可以帮助在组装或重测序之前识别问题,并应有助于指导化学或统计方法进行偏差校正。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eea4/2943903/38b550762d38/pone.0012681.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验