Gruca Aleksandra, Sikora Marek
Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland.
J Biomed Semantics. 2017 Jun 26;8(1):23. doi: 10.1186/s13326-017-0129-x.
High-throughput methods in molecular biology provided researchers with abundance of experimental data that need to be interpreted in order to understand the experimental results. Manual methods of functional gene/protein group interpretation are expensive and time-consuming; therefore, there is a need to develop new efficient data mining methods and bioinformatics tools that could support the expert in the process of functional analysis of experimental results.
In this study, we propose a comprehensive framework for the induction of logical rules in the form of combinations of Gene Ontology (GO) terms for functional interpretation of gene sets. Within the framework, we present four approaches: the fully automated method of rule induction without filtering, rule induction method with filtering, expert-driven rule filtering method based on additive utility functions, and expert-driven rule induction method based on the so-called seed or expert terms - the GO terms of special interest which should be included into the description. These GO terms usually describe some processes or pathways of particular interest, which are related to the experiment that is being performed. During the rule induction and filtering processes such seed terms are used as a base on which the description is build.
We compare the descriptions obtained with different algorithms of rule induction and filtering and show that a filtering step is required to reduce the number of rules in the output set so that they could be analyzed by a human expert. However, filtering may remove information from the output rule set which is potentially interesting for the expert. Therefore, in the study, we present two methods that involve interaction with the expert during the process of rule induction. Both of them are able to reduce the number of rules, but only in the case of the method based on seed terms, each of the created rule includes expert terms in combination with the other terms. Further analysis of such combinations may provide new knowledge about biological processes and their combination with other pathways related to genes described by the rules. A suite of Matlab scripts that provide the functionality of a comprehensive framework for the rule induction and filtering presented in this study is available free of charge at: http://rulego.polsl.pl/framework .
分子生物学中的高通量方法为研究人员提供了大量实验数据,为理解实验结果需要对这些数据进行解读。手动进行功能基因/蛋白质组解读的方法成本高且耗时;因此,需要开发新的高效数据挖掘方法和生物信息学工具,以支持专家对实验结果进行功能分析。
在本研究中,我们提出了一个综合框架,用于以基因本体(GO)术语组合的形式归纳逻辑规则,以对基因集进行功能解读。在该框架内,我们提出了四种方法:无过滤的规则归纳全自动方法、带过滤的规则归纳方法、基于加性效用函数的专家驱动规则过滤方法以及基于所谓种子或专家术语(即应包含在描述中的特别感兴趣的GO术语)的专家驱动规则归纳方法。这些GO术语通常描述一些特别感兴趣的过程或途径,它们与正在进行的实验相关。在规则归纳和过滤过程中,这些种子术语用作构建描述的基础。
我们比较了通过不同规则归纳和过滤算法获得的描述,结果表明需要一个过滤步骤来减少输出集中的规则数量,以便人类专家进行分析。然而,过滤可能会从输出规则集中删除对专家来说可能有潜在兴趣的信息。因此,在本研究中,我们提出了两种在规则归纳过程中涉及与专家交互的方法。它们都能够减少规则数量,但只有基于种子术语的方法,所创建的每个规则都包含专家术语与其他术语的组合。对这些组合的进一步分析可能会提供有关生物过程及其与规则所描述基因相关的其他途径组合关系的新知识。本研究中用于规则归纳和过滤综合框架功能的一套Matlab脚本可在以下网址免费获取:http://rulego.polsl.pl/framework 。