Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy.
Department of Computer Science, Princeton University, Princeton, New Jersey, USA.
BMC Bioinformatics. 2018 Jul 3;19(1):252. doi: 10.1186/s12859-018-2253-8.
Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.
Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.
Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.
HapCHAT is available at http://hapchat.algolab.eu under the GNU Public License (GPL).
单倍型组装是将测序读取所覆盖的变体的不同等位基因分配给人类个体基因组的两个单倍型的过程。长读长现在比以往任何时候都更便宜、更广泛地生产,并且已经被用于减少组装单倍型的碎片化,因为它们能够沿着基因组跨越几个变体。这些长读长也具有高错误率的特点,然而,随着读取集的增大,当这种错误率在基因组位置上均匀分布时,可以减轻这个问题。不幸的是,目前专门为长读长设计的最先进的动态规划方法只处理有限的覆盖范围。
在这里,我们提出了一种新的单倍型组装方法,它结合并扩展了以前的方法的特征,以处理长读长和更高的覆盖范围。特别是,我们的算法能够动态地自适应估计每个变体位置的错误数量,同时最小化找到可行解决方案所需的总错误校正数量。这使得我们的方法能够显著减少所需的计算资源,从而可以考虑由更高覆盖率组成的数据集。该算法已在一个免费提供的工具 HapCHAT 中实现:通过自适应阈值处理覆盖范围的单倍型组装。对高达 60×覆盖范围的测序读取的实验分析表明,考虑更高的覆盖范围可以在更短的运行时间内提高准确性和召回率。
我们的方法利用了测序读取的长程信息,使得能够获得在更少的未相位单倍型块中碎片化的组装单倍型。同时,我们的方法还能够处理更高的覆盖范围,以更好地纠正原始读取中的错误,并因此获得更准确的单倍型。
HapCHAT 可在 http://hapchat.algolab.eu 上获得,根据 GNU 公共许可证 (GPL) 提供。