IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1125-1133. doi: 10.1109/TCBB.2018.2881975. Epub 2018 Nov 19.
A variant caller is used to identify variations in an individual genome (compared to the reference genome) in a genome processing pipeline. For the sake of accuracy, modern variant callers perform many local re-assemblies on small regions of the genome using a graph-based algorithm. However, such graph-based data structures are inefficiently stored in the linear memory of modern computers, which in turn reduces computing efficiency. Therefore, variant calling can take several CPU hours for a typical human genome. We have sped up the local re-assembly algorithm with no impact on its accuracy, by the effective use of the memory hierarchy. The proposed algorithm maximises data locality so that the fast internal processor memory (cache) is efficiently used. By the increased use of caches, accesses to main memory are minimised. The resulting algorithm is up to twice as fast as the original one when executed on a commodity computer and could gain even more speed up on computers with less complex memory subsystems.
变异调用程序用于在基因组处理管道中识别个体基因组(与参考基因组相比)中的变异。为了提高准确性,现代变异调用程序使用基于图的算法对基因组的小区域进行多次局部重新组装。然而,基于图的数据结构在现代计算机的线性内存中存储效率低下,这反过来又降低了计算效率。因此,典型的人类基因组的变异调用可能需要几个 CPU 小时。通过有效利用内存层次结构,我们在不影响准确性的情况下加快了局部重新组装算法。所提出的算法最大程度地提高了数据局部性,以便有效地利用快速的内部处理器内存(缓存)。通过增加缓存的使用,减少了对主内存的访问。在商品计算机上执行时,该算法的速度比原始算法快两倍,在具有更简单内存子系统的计算机上甚至可以获得更快的速度提升。