基因表达序列分析中基于序列的伪影校正。

Correction of sequence-based artifacts in serial analysis of gene expression.

作者信息

Akmaev Viatcheslav R, Wang Clarence J

机构信息

Genzyme Corporation, Framingham, MA 01701-9322, USA.

出版信息

Bioinformatics. 2004 May 22;20(8):1254-63. doi: 10.1093/bioinformatics/bth077. Epub 2004 Feb 10.

DOI:10.1093/bioinformatics/bth077

PMID:14871862

Abstract

MOTIVATION

Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These latter SAGE applications are facilitated by the enhanced method of Long SAGE. A characteristic of sequencing-based methods, such as SAGE and Long SAGE is the unavoidable occurrence of artifact sequences resulting from sequencing errors. By virtue of their low-random incidence, such tag errors have minimal impact on differential expression analysis. However, to fully exploit the value of large SAGE tag datasets, it is desirable to account for and correct tag artifacts.

RESULTS

We present estimates for occurrences of tag errors, and an efficient error correction algorithm. Error rate estimates are based on a stochastic model that includes the Polymerase chain reaction and sequencing error contributions. The correction algorithm, SAGEScreen, is a multi-step procedure that addresses ditag processing, estimation of empirical error rates from highly abundant tags, grouping of similar-sequence tags and statistical testing of observed counts. We apply SAGEScreen to Long SAGE libraries and compare error rates for several processing scenarios. Results with simulated tag collections indicate that SAGEScreen corrects 78% of recoverable tag errors and reduces the occurrences of singleton tags.

AVAILABILITY

The SAGEScreen software is available for academic users from the first author.

摘要

动机

基因表达序列分析（SAGE）是一种强大的技术，可通过快速生成大量转录本标签来测量全局基因表达。除了在差异基因表达分析中的内在价值外，SAGE标签集合还提供了有关样本转录组大小和形状的丰富信息，并可加速新基因的发现。Long SAGE的改进方法促进了SAGE的这些后续应用。基于测序的方法（如SAGE和Long SAGE）的一个特点是不可避免地会出现由测序错误导致的伪像序列。由于这些标签错误的随机发生率较低，因此对差异表达分析的影响最小。然而，为了充分利用大型SAGE标签数据集的价值，需要考虑并校正标签伪像。

结果

我们给出了标签错误发生率的估计值以及一种有效的错误校正算法。错误率估计基于一个随机模型，该模型包括聚合酶链反应和测序错误的影响。校正算法SAGEScreen是一个多步骤过程，涉及双标签处理、从高丰度标签估计经验错误率、相似序列标签分组以及对观察计数的统计检验。我们将SAGEScreen应用于Long SAGE文库，并比较了几种处理方案的错误率。模拟标签集合的结果表明，SAGEScreen校正了78%的可恢复标签错误，并减少了单标签的出现。