Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Sci Rep. 2019 Mar 25;9(1):5059. doi: 10.1038/s41598-019-41451-3.
Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allen's relations. We design and implement a novel approach to address this unmet need. Through rewriting Allen's interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allen's relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allen's relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allen's relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.
高效的大规模基因组区间标注对于精准医学领域的个人基因组解释至关重要。根据艾伦区间代数,两个区间之间有 13 种可能的关系。传统的区间树通常用于识别与查询区间具有粗粒度关系的基因组区间,但无法支持对更精细关系(如所有艾伦关系)的高效查询。我们设计并实现了一种新颖的方法来满足这一未满足的需求。通过重写艾伦区间关系,我们将区间查询转换为范围查询,然后适当地利用范围树进行查询。我们实现了两种类型的范围树:基本的二维范围树(2D-RT)和具有分数级联的增强范围树(RTFC),并将它们与传统的区间树(IT)进行比较。理论分析表明,在三种树中,RTFC 可以实现针对所有艾伦关系的区间查询的最佳时间复杂度。我们还在大规模个人基因组中大量非编码元件注释的查询效率方面对 RTFC、2D-RT 和 IT 进行了比较实验。实验结果表明,在针对大多数艾伦关系的区间查询方面,2D-RT 比 IT 更有效,而 RTFC 甚至比 2D-RT 更有效。结果表明,RTFC 是一种高效的数据结构,可用于查询大规模数据集的艾伦关系,例如在大规模人群中解释全基因组变异所需的关系。