Anisimov Sergey V, Sharov Alexei A
Section for Neuronal Survival, Wallenberg Neuroscience Center, Lund University, 221 84 Lund, Sweden.
BMC Bioinformatics. 2004 Oct 18;5:152. doi: 10.1186/1471-2105-5-152.
Serial Analysis of Gene Expression (SAGE) is a functional genomic technique that quantitatively analyzes the cellular transcriptome. The analysis of SAGE libraries relies on the identification of ditags from sequencing files; however, the software used to examine SAGE libraries cannot distinguish between authentic versus false ditags ("quasi-ditags").
We provide examples of quasi-ditags that originate from cloning and sequencing artifacts (i.e. genomic contamination or random combinations of nucleotides) that are included in SAGE libraries. We have employed a mathematical model to predict the frequency of quasi-ditags in random nucleotide sequences, and our data show that clones containing less than or equal to 2 ditags (which include chromosomal cloning artifacts) should be excluded from the analysis of SAGE catalogs.
Cloning and sequencing artifacts contaminating SAGE libraries could be eliminated using simple pre-screening procedure to increase the reliability of the data.
基因表达系列分析(SAGE)是一种功能基因组技术,用于定量分析细胞转录组。SAGE文库的分析依赖于从测序文件中识别双标签;然而,用于检查SAGE文库的软件无法区分真实双标签与假双标签(“准双标签”)。
我们提供了源自克隆和测序假象(即基因组污染或核苷酸随机组合)的准双标签实例,这些假象包含在SAGE文库中。我们采用了一个数学模型来预测随机核苷酸序列中准双标签的频率,我们的数据表明,包含小于或等于2个双标签的克隆(包括染色体克隆假象)应从SAGE目录分析中排除。
通过简单的预筛选程序可以消除污染SAGE文库的克隆和测序假象,以提高数据的可靠性。