Department of Computer Science, Indiana University, Bloomington, USA.
Bioinformatics. 2018 Jul 1;34(13):i313-i322. doi: 10.1093/bioinformatics/bty268.
Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontology can be seen as directed acyclic graphs (DAGs), where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent sub-graph; that is, a sub-graph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent sub-graphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.
We propose an algorithm for enumerating consistent sub-graphs of DAGs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.
https://github.com/shawn-peng/counting-consistent-sub-DAG.
Supplementary data are available at Bioinformatics online.
现代的概念注释问题将感兴趣的对象(基因、个体、文本文档)与一组相关的文本描述符(功能、疾病、主题)联系起来,这些描述符通常组织在概念层次结构或本体中。大多数本体可以看作是有向无环图(DAG),其中节点表示概念,边表示这些概念之间的关系纽带。给定一个本体图,每个对象只能通过一个一致的子图进行注释;也就是说,一个子图,使得如果一个对象被一个特定的概念注释,它也必须被所有其他概括它的概念注释。本体因此提供了一个紧凑的表示,用于表示一个可能的一致子图的大空间;然而,直到现在,我们还没有意识到一个实用的算法可以为给定的本体枚举这样的注释空间。
我们提出了一种用于枚举 DAG 一致子图的算法。该算法递归地将图划分为严格较小的图,直到得到的图成为一个有根树(森林),对于这种情况,计算出一个线性时间的解决方案。然后,它将递归中创建的图形的计数合并起来,以获得最终的计数。我们证明了该算法的正确性,提出了几种实用的加速方法,在随机图上进行了评估,然后将其应用于描述四个主要的生物医学本体。我们相信这项工作为概念注释空间的复杂性及其对本体注释的可预测性的潜在影响提供了有价值的见解。
https://github.com/shawn-peng/counting-consistent-sub-DAG。
补充数据可在 Bioinformatics 在线获取。