Wang Chen, Davila Jaime I, Baheti Saurabh, Bhagwate Aditya V, Wang Xue, Kocher Jean-Pierre A, Slager Susan L, Feldman Andrew L, Novak Anne J, Cerhan James R, Thompson E Aubrey, Asmann Yan W
Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester MN 55905, Department of Health Sciences Research, Mayo Clinic, 4500 San Pablo Road South, Jacksonville FL 32224, Department of Laboratory Medicine and Pathology, Division of Hematology, Department of Internal Medicine, Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester MN 55905 and Department of Cancer Biology, Mayo Clinic, 4500 San Pablo Road South, Jacksonville FL 32224, USA.
Bioinformatics. 2014 Dec 1;30(23):3414-6. doi: 10.1093/bioinformatics/btu577. Epub 2014 Aug 27.
RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation.
We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering.
RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNA-seq-specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.
The RVboost package is implemented to readily run in Mac or Linux environments. The software and user manual are available at http://bioinformaticstools.mayo.edu/research/rvboost/.
RNA测序已成为定量基因和外显子、发现新转录本以及检测融合基因的首选方法。然而,由于转录组的复杂性、准确映射跨越外显子边界的 reads 的挑战以及测序文库制备过程中引入的偏差,从RNA测序数据中进行可靠的变异鉴定仍然具有挑战性。
我们开发了RVboost,这是一种专门用于RNA变异优先级排序的新方法。RVboost利用了RNA文库制备、测序和RNA测序数据分析过程中独特的几个属性。它使用一种提升方法,利用来自HapMap的常见变异训练一个“高质量”变异模型,并根据训练好的模型对RNA变异进行优先级排序和调用。我们将RVboost打包成一个综合工作流程,该流程集成了变异调用、注释和过滤工具。
在使用配对外显子测序数据中的真实变异的12个RNA测序样本中,RVboost始终优于基因组分析工具包中的变异质量得分重新校准和RNA测序变异调用管道SNPiR。几个RNA测序特有的属性被确定为区分真假变异的关键,包括变异位置到外显子边界的距离,以及在前六个碱基对中支持变异的reads的百分比。后者可识别文库构建过程中随机六聚体引发引入的假变异。
RVboost软件包已实现可在Mac或Linux环境中轻松运行。该软件和用户手册可在http://bioinformaticstools.mayo.edu/research/rvboost/获取。