完整基因组测序读数的快速准确映射。

Fast and accurate mapping of Complete Genomics reads.

作者信息

Lee Donghyuk, Hormozdiari Farhad, Xin Hongyi, Hach Faraz, Mutlu Onur, Alkan Can

机构信息

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA.

Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA.

出版信息

Methods. 2015 Jun;79-80:3-10. doi: 10.1016/j.ymeth.2014.10.012. Epub 2014 Oct 22.

DOI:10.1016/j.ymeth.2014.10.012

PMID:25461772

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4406782/

Abstract

Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.

摘要

由于高通量测序（HTS）技术在对大量人类基因组进行测序方面的强大功能，基因组学领域最近取得了许多进展，个性化医疗的期望也得以实现。目前有数十种不同的测序技术可供使用，每个HTS平台都有不同的优势和偏差。这种多样性既使得利用不同技术来弥补缺点成为可能；但由于数据类型和错误模型的差异，也需要为每个平台开发不同的算法。在分析用于重测序应用的HTS数据时，要解决的第一个问题是读段映射阶段，针对最流行的HTS方法已经开发了许多工具，但对于Complete Genomics（CG）平台，仍然缺乏公开可用的开源比对器。不幸的是，由于这种方法产生的读段具有间隙性质，基于Burrows-Wheeler的方法对于CG数据并不实用。在这里，我们基于种子扩展范式为CG技术提供了一种灵敏的读段映射器（sirFAST），它可以快速将CG读段映射到参考基因组。我们使用模拟数据集和公开可用的真实数据集评估了sirFAST的性能和准确性，结果显示其具有高精度和召回率。