Bioinformatics and Genomics Program, Centre de Regulació Genòmica (CRG), 08003 Barcelona, Spain.
Nucleic Acids Res. 2012 Nov 1;40(20):10073-83. doi: 10.1093/nar/gks666. Epub 2012 Sep 7.
High-throughput sequencing of cDNA libraries constructed from cellular RNA complements (RNA-Seq) naturally provides a digital quantitative measurement for every expressed RNA molecule. Nature, impact and mutual interference of biases in different experimental setups are, however, still poorly understood-mostly due to the lack of data from intermediate protocol steps. We analysed multiple RNA-Seq experiments, involving different sample preparation protocols and sequencing platforms: we broke them down into their common--and currently indispensable--technical components (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing), investigating how such different steps influence abundance and distribution of the sequenced reads. For each of those steps, we developed universally applicable models, which can be parameterised by empirical attributes of any experimental protocol. Our models are implemented in a computer simulation pipeline called the Flux Simulator, and we show that read distributions generated by different combinations of these models reproduce well corresponding evidence obtained from the corresponding experimental setups. We further demonstrate that our in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed. Moreover, our simulations identify hitherto unreported sources of systematic bias from RNA hydrolysis, a fragmentation technique currently employed by most RNA-Seq protocols.
从细胞 RNA 互补物(RNA-Seq)构建的 cDNA 文库的高通量测序自然为每个表达 RNA 分子提供了数字定量测量。然而,不同实验设置中偏差的影响、性质和相互干扰仍然了解甚少——主要是由于缺乏中间协议步骤的数据。我们分析了多个 RNA-Seq 实验,涉及不同的样品制备方案和测序平台:我们将它们分解为常见的——目前不可或缺的——技术组件(反转录、片段化、接头连接、PCR 扩增、凝胶分离和测序),研究这些不同步骤如何影响测序reads 的丰度和分布。对于这些步骤中的每一个,我们开发了通用的模型,这些模型可以通过任何实验方案的经验属性进行参数化。我们的模型在称为通量模拟器的计算机模拟管道中实现,我们表明,由这些模型的不同组合生成的读取分布很好地再现了从相应实验设置获得的相应证据。我们进一步证明,我们的 RNA-Seq 可以深入了解隐藏的前体,这些前体决定了基因体中读取的最终配置;可以观察到增强或补偿效应,这些效应可以解释明显有争议的观察结果。此外,我们的模拟确定了迄今未报道的系统偏差源,这些源来自 RNA 水解,这是目前大多数 RNA-Seq 方案中采用的一种片段化技术。