一种用于多路区间集交集的并行算法。

A parallel algorithm for -way interval set intersection.

作者信息

Layer Ryan M, Quinlan Aaron R

机构信息

Department of Human Genetics, University of Utah, Salt Lake City, UT, 84112.

Department of Human Genetics, University of Utah, Salt Lake City, UT, 84112. Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84112.

出版信息

Proc IEEE Inst Electr Electron Eng. 2017 Mar;105(3):542-551. doi: 10.1109/JPROC.2015.2461494.

DOI:10.1109/JPROC.2015.2461494

PMID:30333632

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6188649/

Abstract

The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that is, if feature sets contain overlapping intervals then it is inferred that they share a common biological function or origin. Using this technique, researchers identify genomic regions that are common among multiple (or unique to individual) datasets. While there have been recent advances in algorithms for pairwise intersections between two sets of genomic intervals, few advances have been made to the intersection of many sets of genomic intervals. Identifying intersections among many interval sets is particularly important when attempting to distill biological insights from the massive, multi-dimensional datasets that are common to modern genome research. For such analyses, speed and efficiency are crucial given the size and sheer number of datasets involved. To solve this problem, we present a novel "slice-then-sweep" algorithm that, given interval sets, efficiently reveals the subset of intervals that are common to all sets. We demonstrate that our algorithm is more efficient in the sequential case and has a vastly higher capacity for parallelization with a 19x speedup over the existing algorithm.

摘要

基因组区间集（例如，基因、重复序列、ChIP-seq峰）的比较对于基因组研究至关重要，特别是在现代测序技术使得实验规模越来越大且越来越复杂的情况下。基因组特征之间的关系通常通过它们的交集来确定：也就是说，如果特征集包含重叠区间，那么就推断它们具有共同的生物学功能或起源。使用这种技术，研究人员可以识别多个数据集共有的（或单个数据集特有的）基因组区域。虽然最近在两组基因组区间的成对交集算法方面取得了进展，但在多组基因组区间的交集方面进展甚微。当试图从现代基因组研究中常见的大规模、多维度数据集中提炼生物学见解时，识别多个区间集之间的交集尤为重要。对于此类分析，鉴于所涉及数据集的规模和数量，速度和效率至关重要。为了解决这个问题，我们提出了一种新颖的“切片然后扫描”算法，该算法在给定区间集的情况下，能够有效地揭示所有集合共有的区间子集。我们证明，我们的算法在顺序情况下更高效，并且具有更高的并行化能力，比现有算法快19倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b5d/6188649/380d41eb489a/nihms854502f1.jpg

相似文献

A parallel algorithm for -way interval set intersection.一种用于多路区间集交集的并行算法。

Proc IEEE Inst Electr Electron Eng. 2017 Mar;105(3):542-551. doi: 10.1109/JPROC.2015.2461494.

Binary Interval Search: a scalable algorithm for counting interval intersections.二进制区间搜索：一种用于计算区间交集的可扩展算法。

Bioinformatics. 2013 Jan 1;29(1):1-7. doi: 10.1093/bioinformatics/bts652. Epub 2012 Nov 4.

JOA: Joint Overlap Analysis of multiple genomic interval sets.JOA：多个基因组区间集的联合重叠分析。

BMC Bioinformatics. 2019 Mar 8;20(1):121. doi: 10.1186/s12859-019-2698-4.

Engineering Aspects of Olfaction嗅觉的工程学方面

Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data.读取-分割-运行：一种利用RNA测序数据识别全基因组非经典剪接区域的改进型生物信息学流程。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):503. doi: 10.1186/s12864-016-2896-7.

PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals.PEGS：一种在定义的基因组区间集中进行基因集富集的有效工具。

F1000Res. 2021 Jul 15;10:570. doi: 10.12688/f1000research.53926.2. eCollection 2021.

BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets.BroadPeak：一种用于识别弥散 ChIP-seq 数据集的宽峰的新算法。

Bioinformatics. 2013 Feb 15;29(4):492-3. doi: 10.1093/bioinformatics/bts722. Epub 2013 Jan 7.

Operating on Genomic Ranges Using BEDOPS.使用BEDOPS对基因组范围进行操作。

Methods Mol Biol. 2016;1418:267-81. doi: 10.1007/978-1-4939-3578-9_14.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

Intersections and Non-Intersections: A Protocol for Identifying Pedestrian Crash Risk Locations in GIS.交叉口与非交叉口：GIS 中识别行人碰撞风险地点的方案。

Int J Environ Res Public Health. 2019 Sep 24;16(19):3565. doi: 10.3390/ijerph16193565.

引用本文的文献

JOA: Joint Overlap Analysis of multiple genomic interval sets.JOA：多个基因组区间集的联合重叠分析。

BMC Bioinformatics. 2019 Mar 8;20(1):121. doi: 10.1186/s12859-019-2698-4.

Vcfanno: fast, flexible annotation of genetic variants.Vcfanno：基因变异的快速、灵活注释

Genome Biol. 2016 Jun 1;17(1):118. doi: 10.1186/s13059-016-0973-5.

本文引用的文献

Binary Interval Search: a scalable algorithm for counting interval intersections.二进制区间搜索：一种用于计算区间交集的可扩展算法。

Bioinformatics. 2013 Jan 1;29(1):1-7. doi: 10.1093/bioinformatics/bts652. Epub 2012 Nov 4.

Systematic localization of common disease-associated variation in regulatory DNA.调控 DNA 中常见疾病相关变异的系统定位。

Science. 2012 Sep 7;337(6099):1190-5. doi: 10.1126/science.1222794. Epub 2012 Sep 5.

An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

BEDOPS: high-performance genomic feature operations.BEDOPS：高性能基因组特征操作。

Bioinformatics. 2012 Jul 15;28(14):1919-20. doi: 10.1093/bioinformatics/bts277. Epub 2012 May 9.

A user's guide to the encyclopedia of DNA elements (ENCODE).DNA 元件百科全书（ENCODE）使用指南

PLoS Biol. 2011 Apr;9(4):e1001046. doi: 10.1371/journal.pbio.1001046. Epub 2011 Apr 19.

Tabix: fast retrieval of sequence features from generic TAB-delimited files.Tabix：从通用制表符分隔文件中快速检索序列特征。

Bioinformatics. 2011 Mar 1;27(5):718-9. doi: 10.1093/bioinformatics/btq671. Epub 2011 Jan 5.

The NIH Roadmap Epigenomics Mapping Consortium.美国国立卫生研究院（NIH）路线图表观基因组学图谱联盟。

Nat Biotechnol. 2010 Oct;28(10):1045-8. doi: 10.1038/nbt1010-1045.

BEDTools: a flexible suite of utilities for comparing genomic features.BEDTools：一套灵活的基因组特征比较工具套件。

Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28.

The Sequence Alignment/Map format and SAMtools.序列比对/映射格式和 SAMtools。

Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.

Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing.使用随机引物cDNA和大规模平行短读测序对HeLa S3转录组进行分析。

Biotechniques. 2008 Jul;45(1):81-94. doi: 10.2144/000112900.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验