Jiao Xiaoli, Zheng Xin, Ma Liang, Kutty Geetha, Gogineni Emile, Sun Qiang, Sherman Brad T, Hu Xiaojun, Jones Kristine, Raley Castle, Tran Bao, Munroe David J, Stephens Robert, Liang Dun, Imamichi Tomozumi, Kovacs Joseph A, Lempicki Richard A, Huang Da Wei
Laboratory of Immunopathogenesis and Bioinformatics, SAIC-Frederick, Inc., Frederick National Laboratory, MD 21702, USA.
J Data Mining Genomics Proteomics. 2013 Jul 31;4(3). doi: 10.4172/2153-0602.1000136.
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
PacBio RS是一种新兴的第三代DNA测序平台,它基于实时、单分子、纳米孔测序技术,与第一代和第二代测序技术产生的较短读长相比,该技术能够生成非常长的读长(长达20 kb)。作为一个新平台,评估测序错误率以及与PacBio序列数据相关的质量控制(QC)参数非常重要。在本研究中,使用PacBio RS测序平台对10个先前已知的、密切相关的DNA扩增子混合物进行了测序。将上述测序实验得到的环形一致序列(CCS)读段与已知参考序列比对后,我们发现,在未进行读段质量控制时,中位错误率为2.5%,而采用基于支持向量机的多参数质量控制方法后,错误率降至1.3%。此外,还将组装作为下游应用来评估不同质量控制方法的效果。这项基准研究表明,即使CCS读段经过了错误校正,为了获得成功的下游生物信息学分析结果,仍有必要对CCS读段进行适当的质量控制。