在LongSAGE分析中丢弃重复的双标签可能会引入重大误差。

Discarding duplicate ditags in LongSAGE analysis may introduce significant error.

作者信息

Emmersen Jeppe, Heidenblut Anna M, Høgh Annabeth Laursen, Hahn Stephan A, Welinder Karen G, Nielsen Kåre L

机构信息

Department of Biotechnology, Chemistry and Environmental Engineering, Aalborg University, Aalborg, Denmark.

出版信息

BMC Bioinformatics. 2007 Mar 14;8:92. doi: 10.1186/1471-2105-8-92.

DOI:10.1186/1471-2105-8-92

PMID:17359537

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1839111/

Abstract

BACKGROUND

During gene expression analysis by Serial Analysis of Gene Expression (SAGE), duplicate ditags are routinely removed from the data analysis, because they are suspected to stem from artifacts during SAGE library construction. As a consequence, naturally occurring duplicate ditags are also removed from the analysis leading to an error of measurement.

RESULTS

An algorithm was developed to analyze the differential occurrence of SAGE tags in different ditag combinations. Analysis of a pancreatic acinar cell LongSAGE library showed no sign of a general amplification bias that justified the removal of all duplicate ditags. Extending the analysis to 10 additional LongSAGE libraries showed no justification for removal of all duplicate ditags either. On the contrary, while the error introduced in original SAGE by removal of naturally occurring duplicate ditags is insignificant, it leads to an error of up to 3 fold in LongSAGE. However, the algorithm developed for the analysis of duplicate ditags was able to identify individual artifact ditags that originated from rare nucleotide variations of tags and vector contamination.

CONCLUSION

The removal of all duplicate ditags was unfounded for the datasets analyzed and led to large errors. This may also be the case for other LongSAGE datasets already present in databases. Analysis of the ditag population, however, can identify artifact tags that should be removed from analysis or have their tag count adjusted.

摘要

背景

在通过基因表达序列分析（SAGE）进行基因表达分析时，重复的双标签在数据分析过程中通常会被去除，因为怀疑它们源于SAGE文库构建过程中的人为因素。因此，天然存在的重复双标签也会从分析中被去除，从而导致测量误差。

结果

开发了一种算法来分析不同双标签组合中SAGE标签的差异出现情况。对胰腺腺泡细胞LongSAGE文库的分析表明，没有迹象表明存在普遍的扩增偏差，从而证明去除所有重复双标签是合理的。将分析扩展到另外10个LongSAGE文库也表明没有理由去除所有重复双标签。相反，虽然在原始SAGE中去除天然存在的重复双标签所引入的误差微不足道，但在LongSAGE中却会导致高达3倍的误差。然而，为分析重复双标签而开发的算法能够识别出源于标签的罕见核苷酸变异和载体污染的个别人为双标签。