School of Computing, University of Kent, Canterbury, Kent, UK.
Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK.
Bioinformatics. 2018 Jul 15;34(14):2449-2456. doi: 10.1093/bioinformatics/bty087.
This work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model.
The new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure.
The dataset and source codes used in this paper are available as 'Supplementary Material' and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/.
Supplementary data are available at Bioinformatics online.
本研究采用随机森林 (RF) 分类算法来预测基因在大脑中随年龄的表达是上调、下调还是无变化。RF 具有较高的预测能力,并且可以使用特征(变量)重要性度量来解释 RF 模型。然而,目前的特征重要性度量方法将特征作为一个整体(所有特征值)进行评估。我们发现,对于一种常见类型的生物数据(基于基因本体论),通常只有一个特征值对于分类和 RF 模型的解释非常重要。因此,我们提出了一种新的算法来识别 RF 模型中最重要和最具信息量的特征值。
新的特征重要性度量方法确定了上述基因分类任务中高度相关的基因本体论术语,生成的特征排序比替代的、最先进的特征重要性度量方法更能为生物学家提供信息。
本文使用的数据集和源代码可作为“补充材料”获得,有关数据的说明可在以下网址找到:https://fabiofabris.github.io/bioinfo2018/web/。
补充数据可在生物信息学在线获得。