Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden.
Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
BMC Bioinformatics. 2021 Mar 6;22(1):110. doi: 10.1186/s12859-021-04049-z.
Machine learning involves strategies and algorithms that may assist bioinformatics analyses in terms of data mining and knowledge discovery. In several applications, viz. in Life Sciences, it is often more important to understand how a prediction was obtained rather than knowing what prediction was made. To this end so-called interpretable machine learning has been recently advocated. In this study, we implemented an interpretable machine learning package based on the rough set theory. An important aim of our work was provision of statistical properties of the models and their components.
We present the R.ROSETTA package, which is an R wrapper of ROSETTA framework. The original ROSETTA functions have been improved and adapted to the R programming environment. The package allows for building and analyzing non-linear interpretable machine learning models. R.ROSETTA gathers combinatorial statistics via rule-based modelling for accessible and transparent results, well-suited for adoption within the greater scientific community. The package also provides statistics and visualization tools that facilitate minimization of analysis bias and noise. The R.ROSETTA package is freely available at https://github.com/komorowskilab/R.ROSETTA . To illustrate the usage of the package, we applied it to a transcriptome dataset from an autism case-control study. Our tool provided hypotheses for potential co-predictive mechanisms among features that discerned phenotype classes. These co-predictors represented neurodevelopmental and autism-related genes.
R.ROSETTA provides new insights for interpretable machine learning analyses and knowledge-based systems. We demonstrated that our package facilitated detection of dependencies for autism-related genes. Although the sample application of R.ROSETTA illustrates transcriptome data analysis, the package can be used to analyze any data organized in decision tables.
机器学习涉及的策略和算法可协助生物信息学进行数据分析和知识发现。在许多应用中,例如生命科学领域,了解预测是如何得出的通常比知道做出了什么预测更为重要。为此,最近提倡使用可解释的机器学习。在这项研究中,我们实现了一个基于粗糙集理论的可解释机器学习包。我们工作的一个重要目标是提供模型及其组件的统计特性。
我们展示了 R.ROSETTA 包,它是 ROSETTA 框架的 R 包装器。原始的 ROSETTA 函数已经过改进和适应 R 编程环境。该包允许构建和分析非线性可解释的机器学习模型。R.ROSETTA 通过基于规则的建模收集组合统计信息,以提供可访问和透明的结果,非常适合在更广泛的科学界中采用。该包还提供了统计和可视化工具,有助于最小化分析偏差和噪声。R.ROSETTA 包可在 https://github.com/komorowskilab/R.ROSETTA 上免费获得。为了说明该包的用法,我们将其应用于自闭症病例对照研究的转录组数据集。我们的工具提供了辨别表型类别的特征之间的潜在共同预测机制的假说。这些共同预测因子代表了神经发育和自闭症相关基因。
R.ROSETTA 为可解释的机器学习分析和基于知识的系统提供了新的见解。我们表明,我们的包有助于检测与自闭症相关的基因的依赖性。尽管 R.ROSETTA 的示例应用说明了转录组数据分析,但该包可用于分析以决策表形式组织的任何数据。