Liu Donglin, Graber Joel H
The Jackson Laboratory, Bar Harbor, ME 04609, USA.
BMC Bioinformatics. 2006 Feb 17;7:77. doi: 10.1186/1471-2105-7-77.
Publicly accessible EST libraries contain valuable information that can be utilized for studies of tissue-specific gene expression and processing of individual genes. This information is, however, confounded by multiple systematic effects arising from the procedures used to generate these libraries.
We used alignment of ESTs against a reference set of transcripts to estimate the size distributions of the cDNA inserts and sampled mRNA transcripts in individual EST libraries and show how these measurements can be used to inform quantitative comparisons of libraries. While significant attention has been paid to the effects of normalization and substraction, we also find significant biases in transcript sampling introduced by the combined procedures of reverse transcription and selection of cDNA clones for sequencing. Using examples drawn from studies of mRNA 3'-processing (cleavage and polyadenylation), we demonstrate effects of the transcript sampling bias, and provide a method for identifying libraries that can be safely compared without bias. All data sets, supplemental data, and software are available at our supplemental web site.
The biases we characterize in the transcript sampling of EST libraries represent a significant and heretofore under-appreciated source of false positive candidates for tissue-, cell type-, or developmental stage-specific activity or processing of genes. Uncorrected, quantitative comparison of dissimilar EST libraries will likely result in the identification of statistically significant, but biologically meaningless changes.
公开可用的EST文库包含有价值的信息,可用于研究组织特异性基因表达和单个基因的加工。然而,这些信息因用于生成这些文库的程序产生的多种系统效应而变得复杂。
我们将EST与一组转录本参考序列进行比对,以估计各个EST文库中cDNA插入片段和抽样mRNA转录本的大小分布,并展示这些测量如何用于为文库的定量比较提供信息。虽然人们已对标准化和扣除的效应给予了极大关注,但我们还发现,逆转录和选择用于测序的cDNA克隆的联合程序在转录本抽样中引入了显著偏差。通过从mRNA 3'加工(切割和聚腺苷酸化)研究中选取的例子,我们展示了转录本抽样偏差的效应,并提供了一种识别可无偏差安全比较的文库的方法。所有数据集、补充数据和软件均可在我们的补充网站上获取。
我们所描述的EST文库转录本抽样中的偏差,是组织、细胞类型或发育阶段特异性基因活性或加工的假阳性候选物的一个重要且迄今未得到充分认识的来源。未经校正的不同EST文库的定量比较可能会导致识别出具有统计学意义但生物学上无意义的变化。