HapCol：从长读段中进行准确且内存高效的单倍型组装。

HapCol: accurate and memory-efficient haplotype assembly from long reads.

机构信息

Dipartimento di Informatica Sistemistica e Comunicazione (DISCo), Univ. degli Studi di Milano-Bicocca, Milan, Italy.

Dipartimento di Scienze Umane e Sociali, Univ. degli Studi di Bergamo, Bergamo, Italy.

出版信息

Bioinformatics. 2016 Jun 1;32(11):1610-7. doi: 10.1093/bioinformatics/btv495. Epub 2015 Aug 26.

DOI:10.1093/bioinformatics/btv495

PMID:26315913

Abstract

MOTIVATION

Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions.

RESULTS

By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption.

AVAILABILITY AND IMPLEMENTATION

Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/

CONTACT

bonizzoni@disco.unimib.it

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

单倍型组装是对二倍体生物进行单倍型重建的计算问题，对于描述单核苷酸多态性对表型性状表达的影响具有重要意义。单倍型组装极大地受益于“新一代”测序技术的出现，以及它们能够以越来越高的覆盖度生成长读长的能力。现有的方法要么因为准确性或性能随着读长和测序覆盖度的增加而降低，要么因为它们基于限制性假设，因此无法以完全令人满意的方式处理此类数据。

结果

通过利用未来一代技术的一个特点——测序错误的均匀分布，我们设计了一种精确算法，称为 HapCol，它在每个单核苷酸多态性位置的最大纠错数上是指数级的，并且最小化整体纠错得分。我们在真实和模拟数据上进行了实验分析，将 HapCol 与当前最先进的组合方法进行了比较。在真实数据的标准基准测试中，我们表明 HapCol 与最先进的方法具有竞争力，提高了准确性和相位数。此外，在真实模拟数据集上的实验表明，HapCol 所需的计算资源（尤其是内存）明显更少。由于其计算效率，HapCol 可以克服以前方法的限制，允许在更高覆盖度下相位数据集，而无需传统的全杂合假设。