Malinka František, Železný Filip, Kléma Jiří
Department of Computer Science, Czech Technical University in Prague, Karlovo náměstí 13, Prague, 121 35 Czech Republic.
Czech Centre for Phenogenomics, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic.
BioData Min. 2020 Sep 1;13:13. doi: 10.1186/s13040-020-00219-6. eCollection 2020.
Identification of non-trivial and meaningful patterns in omics data is one of the most important biological tasks. The patterns help to better understand biological systems and interpret experimental outcomes. A well-established method serving to explain such biological data is Gene Set Enrichment Analysis. However, this type of analysis is restricted to a specific type of evaluation. Abstracting from details, the analyst provides a sorted list of genes and ontological annotations of the individual genes; the method outputs a subset of ontological terms enriched in the gene list. Here, in contrary to enrichment analysis, we introduce a new tool/framework that allows for the induction of more complex patterns of 2-dimensional binary omics data. This extension allows to discover and describe semantically coherent biclusters.
We present a new rapid method called sem1R that reveals interpretable hidden rules in omics data. These rules capture semantic differences between two classes: a target class as a collection of positive examples and a non-target class containing negative examples. The method is inspired by the CN2 rule learner and introduces a new refinement operator that exploits prior knowledge in the form of ontologies. In our work this knowledge serves to create accurate and interpretable rules. The novel refinement operator uses two reduction procedures: Redundant Generalization and Redundant Non-potential, both of which help to dramatically prune the rule space and consequently, speed-up the entire process of rule induction in comparison with the traditional refinement operator as is presented in CN2.
Efficiency and effectivity of the novel refinement operator were tested on three real different gene expression datasets. Concretely, the Dresden Ovary Dataset, DISC, and m2816 were employed. The experiments show that the ontology-based refinement operator speeds-up the pattern induction drastically. The algorithm is written in C++ and is published as an R package available at http://github.com/fmalinka/sem1r.
在组学数据中识别重要且有意义的模式是最重要的生物学任务之一。这些模式有助于更好地理解生物系统并解释实验结果。一种用于解释此类生物数据的成熟方法是基因集富集分析。然而,这种类型的分析仅限于特定类型的评估。概括来说,分析人员提供一份排序的基因列表以及各个基因的本体注释;该方法输出基因列表中富集的本体术语子集。在此,与富集分析相反,我们引入了一种新工具/框架,它允许对二维二元组学数据进行更复杂模式的归纳。这种扩展使得能够发现和描述语义连贯的双聚类。
我们提出了一种名为sem1R的新的快速方法,该方法能揭示组学数据中可解释的隐藏规则。这些规则捕捉两个类别之间的语义差异:作为正例集合的目标类别和包含负例的非目标类别。该方法受CN2规则学习器的启发,并引入了一种新的细化算子,该算子利用本体形式的先验知识。在我们的工作中,这些知识用于创建准确且可解释的规则。新颖的细化算子使用两种约简过程:冗余泛化和冗余非潜力,与CN2中提出的传统细化算子相比,这两种过程都有助于大幅修剪规则空间,从而加快规则归纳的整个过程。
在三个真实不同的基因表达数据集上测试了新颖细化算子的效率和有效性。具体而言,使用了德累斯顿卵巢数据集、DISC和m2816。实验表明,基于本体的细化算子极大地加快了模式归纳。该算法用C++编写,并作为R包发布,可在http://github.com/fmalinka/sem1r获取。