Layer Ryan M, Quinlan Aaron R
Department of Human Genetics, University of Utah, Salt Lake City, UT, 84112.
Department of Human Genetics, University of Utah, Salt Lake City, UT, 84112. Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84112.
Proc IEEE Inst Electr Electron Eng. 2017 Mar;105(3):542-551. doi: 10.1109/JPROC.2015.2461494.
The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that is, if feature sets contain overlapping intervals then it is inferred that they share a common biological function or origin. Using this technique, researchers identify genomic regions that are common among multiple (or unique to individual) datasets. While there have been recent advances in algorithms for pairwise intersections between two sets of genomic intervals, few advances have been made to the intersection of many sets of genomic intervals. Identifying intersections among many interval sets is particularly important when attempting to distill biological insights from the massive, multi-dimensional datasets that are common to modern genome research. For such analyses, speed and efficiency are crucial given the size and sheer number of datasets involved. To solve this problem, we present a novel "slice-then-sweep" algorithm that, given interval sets, efficiently reveals the subset of intervals that are common to all sets. We demonstrate that our algorithm is more efficient in the sequential case and has a vastly higher capacity for parallelization with a 19x speedup over the existing algorithm.
基因组区间集(例如,基因、重复序列、ChIP-seq峰)的比较对于基因组研究至关重要,特别是在现代测序技术使得实验规模越来越大且越来越复杂的情况下。基因组特征之间的关系通常通过它们的交集来确定:也就是说,如果特征集包含重叠区间,那么就推断它们具有共同的生物学功能或起源。使用这种技术,研究人员可以识别多个数据集共有的(或单个数据集特有的)基因组区域。虽然最近在两组基因组区间的成对交集算法方面取得了进展,但在多组基因组区间的交集方面进展甚微。当试图从现代基因组研究中常见的大规模、多维度数据集中提炼生物学见解时,识别多个区间集之间的交集尤为重要。对于此类分析,鉴于所涉及数据集的规模和数量,速度和效率至关重要。为了解决这个问题,我们提出了一种新颖的“切片然后扫描”算法,该算法在给定区间集的情况下,能够有效地揭示所有集合共有的区间子集。我们证明,我们的算法在顺序情况下更高效,并且具有更高的并行化能力,比现有算法快19倍。