Benshahar Arnon, Chalifa-Caspi Vered, Hermelin Danny, Ziv-Ukelson Michal
1 Department of Computer Science, Ben-Gurion University , Beer-Sheva, Israel .
2 National Institute for Biotechnology in the Negev, Ben-Gurion University , Beer-Sheva, Israel .
J Comput Biol. 2018 Feb;25(2):214-235. doi: 10.1089/cmb.2017.0108. Epub 2017 Oct 13.
We formalize a new problem variant in gene-block discovery, denoted Reference-Anchored Gene Blocks (RAGB), given a query sequence Q of length n, representing the gene array of a DNA element, a window size bound d on the length of a substring of interest in Q, and a set of target gene sequences [Formula: see text]. Our objective is to identify gene blocks in [Formula: see text] that are centered in a subset q of co-localized genes from Q, and contain genomes from [Formula: see text] in which the corresponding orthologs of the genes from q are also co-localized. We cast RAGB as a variant of a (colored) biclique problem in bipartite graphs, and analyze its parameterized complexity, as well as the parameterized complexity of other related problems. We give an [Formula: see text] time algorithm for the uncolored variant of our biclique problem, where m is the number of areas of interest that are parsed from the target sequences, and n and d are as defined earlier. Our algorithm can be adapted to compute all maximal bicliques in the graph within the same time complexity, and to handle edge weights with a slight [Formula: see text] increase to its time complexity. For the colored version of the problem, our algorithm has a time complexity of [Formula: see text]. We implement the algorithm and exemplify its application to the data mining of proteobacterial gene blocks that are centered in predicted proteobacterial genomic islands, leading to the identification of putatively mobilized clusters of virulence, pathogenicity, and resistance genes.
在基因块发现中,我们形式化了一个新的问题变体,称为参考锚定基因块(RAGB)。给定一个长度为n的查询序列Q,它代表一个DNA元件的基因阵列,一个关于Q中感兴趣子串长度的窗口大小界限d,以及一组目标基因序列[公式:见原文]。我们的目标是在[公式:见原文]中识别基因块,这些基因块以Q中一组共定位基因的子集q为中心,并且包含来自[公式:见原文]的基因组,其中q中基因的相应直系同源基因也在这些基因组中共定位。我们将RAGB转化为二分图中(带颜色的)双团问题的一个变体,并分析其参数化复杂度以及其他相关问题的参数化复杂度。对于我们双团问题的无色变体,我们给出了一个[公式:见原文]时间复杂度的算法,其中m是从目标序列中解析出的感兴趣区域的数量,n和d如前文所定义。我们的算法可以在相同的时间复杂度内进行调整以计算图中的所有最大双团,并在时间复杂度略有[公式:见原文]增加的情况下处理边权重。对于该问题的带颜色版本,我们的算法时间复杂度为[公式:见原文]。我们实现了该算法,并举例说明了其在以预测的变形菌属基因组岛为中心的变形菌属基因块数据挖掘中的应用,从而识别出可能被转移的毒力、致病性和抗性基因簇。