Bioinformatics Program, Boston University, 44 Cummington Mall, Boston, MA, 02215, USA.
Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA.
BMC Genomics. 2018 Jul 13;19(1):531. doi: 10.1186/s12864-018-4933-1.
RNA-seq and small RNA-seq are powerful, quantitative tools to study gene regulation and function. Common high-throughput sequencing methods rely on polymerase chain reaction (PCR) to expand the starting material, but not every molecule amplifies equally, causing some to be overrepresented. Unique molecular identifiers (UMIs) can be used to distinguish undesirable PCR duplicates derived from a single molecule and identical but biologically meaningful reads from different molecules.
We have incorporated UMIs into RNA-seq and small RNA-seq protocols and developed tools to analyze the resulting data. Our UMIs contain stretches of random nucleotides whose lengths sufficiently capture diverse molecule species in both RNA-seq and small RNA-seq libraries generated from mouse testis. Our approach yields high-quality data while allowing unique tagging of all molecules in high-depth libraries.
Using simulated and real datasets, we demonstrate that our methods increase the reproducibility of RNA-seq and small RNA-seq data. Notably, we find that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine PCR duplicate frequency. Finally, we show that computational removal of PCR duplicates based only on their mapping coordinates introduces substantial bias into data analysis.
RNA-seq 和 small RNA-seq 是研究基因调控和功能的强大、定量工具。常见的高通量测序方法依赖于聚合酶链反应(PCR)来扩增起始材料,但并非每个分子都能平等扩增,导致一些分子过度代表。独特分子标识符(UMI)可用于区分来自单个分子的不理想的 PCR 重复,以及来自不同分子的相同但具有生物学意义的读取。
我们已经将 UMIs 纳入 RNA-seq 和 small RNA-seq 方案,并开发了用于分析所得数据的工具。我们的 UMIs 包含一段随机核苷酸,其长度足以捕获来自小鼠睾丸的 RNA-seq 和 small RNA-seq 文库中不同分子种类。我们的方法在允许高深度文库中所有分子进行独特标记的同时,产生高质量的数据。
使用模拟和真实数据集,我们证明了我们的方法提高了 RNA-seq 和 small RNA-seq 数据的可重复性。值得注意的是,我们发现起始材料的数量和测序深度,但不是 PCR 循环的数量,决定了 PCR 重复的频率。最后,我们表明仅基于其映射坐标去除 PCR 重复会给数据分析带来很大的偏差。