Fungtammasan Arkarachai, Tomaszkiewicz Marta, Campos-Sánchez Rebeca, Eckert Kristin A, DeGiorgio Michael, Makova Kateryna D
Integrative Biosciences, Bioinformatics and Genomics Option, Pennsylvania State University Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University Huck Institute of Genome Sciences, Pennsylvania State University.
Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University.
Mol Biol Evol. 2016 Oct;33(10):2744-58. doi: 10.1093/molbev/msw139. Epub 2016 Jul 12.
Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.
转录本变异对健康和疾病状态下的机体功能具有重要影响。大多数转录组研究聚焦于评估基因表达水平和异构体表现形式的变异。转录本序列水平的变异由RNA编辑和转录错误引起,会导致非基因编码的转录本变体,即RNA-DNA差异(RDDs)。此类变异一直未得到充分研究,部分原因是其检测会被逆转录(RT)和测序错误所掩盖。此前仅针对转录本间碱基替换差异进行过评估。在此,我们研究了短串联重复序列(STRs)的转录本序列变异。我们开发了首个最大似然估计器(MLE)来推断RT错误率和RDD率,同时考虑了新一代测序错误率。通过MLE,我们在一个灵长类物种进行的大规模DNA和RNA重复测序实验中,凭经验评估了STRs的RT错误率和RDD率。RT错误率随STR长度呈指数增长,且倾向于扩增。RDD率比RT错误率低约1个数量级。用MLE从灵长类数据集估计的RT错误率与用独立方法(条形码RNA测序)从秀丽隐杆线虫数据集估计的结果一致。我们的结果对医学基因组学具有重要意义,因为STR等位基因变异与40多种疾病相关。STR非等位基因转录本变异也可能导致疾病表型。本文提出的MLE和经验率可用于评估因RDD产生疾病相关转录本的概率。