Anděl Michael, Kléma Jiří, Krejčík Zdeněk
Department of Computer Science, Czech Technical University, Technická 2, Prague, Czech Republic.
Department of Molecular Genetics, Institute of Hematology and Blood Transfusion, U Nemocnice 1, Prague, Czech Republic.
Methods. 2015 Jul 15;83:88-97. doi: 10.1016/j.ymeth.2015.04.006. Epub 2015 Apr 11.
Contemporary molecular biology deals with wide and heterogeneous sets of measurements to model and understand underlying biological processes including complex diseases. Machine learning provides a frequent approach to build such models. However, the models built solely from measured data often suffer from overfitting, as the sample size is typically much smaller than the number of measured features. In this paper, we propose a random forest-based classifier that reduces this overfitting with the aid of prior knowledge in the form of a feature interaction network. We illustrate the proposed method in the task of disease classification based on measured mRNA and miRNA profiles complemented by the interaction network composed of the miRNA-mRNA target relations and mRNA-mRNA interactions corresponding to the interactions between their encoded proteins. We demonstrate that the proposed network-constrained forest employs prior knowledge to increase learning bias and consequently to improve classification accuracy, stability and comprehensibility of the resulting model. The experiments are carried out in the domain of myelodysplastic syndrome that we are concerned about in the long term. We validate our approach in the public domain of ovarian carcinoma, with the same data form. We believe that the idea of a network-constrained forest can straightforwardly be generalized towards arbitrary omics data with an available and non-trivial feature interaction network. The proposed method is publicly available in terms of miXGENE system (http://mixgene.felk.cvut.cz), the workflow that implements the myelodysplastic syndrome experiments is presented as a dedicated case study.
当代分子生物学涉及广泛且多样的测量数据集,用于对包括复杂疾病在内的潜在生物学过程进行建模和理解。机器学习提供了一种常用的方法来构建此类模型。然而,仅基于测量数据构建的模型往往会受到过拟合的影响,因为样本大小通常远小于测量特征的数量。在本文中,我们提出了一种基于随机森林的分类器,该分类器借助以特征交互网络形式存在的先验知识来减少这种过拟合。我们在基于测量的mRNA和miRNA谱进行疾病分类的任务中说明了所提出的方法,并辅以由miRNA - mRNA靶标关系以及与其编码蛋白质之间相互作用相对应的mRNA - mRNA相互作用组成的交互网络。我们证明,所提出的网络约束森林利用先验知识来增加学习偏差,从而提高所得模型的分类准确性、稳定性和可理解性。实验是在我们长期关注的骨髓增生异常综合征领域进行的。我们在具有相同数据形式的卵巢癌公共领域验证了我们的方法。我们相信,网络约束森林的概念可以直接推广到具有可用且非平凡特征交互网络的任意组学数据。所提出的方法可通过miXGENE系统(http://mixgene.felk.cvut.cz)公开获取,实现骨髓增生异常综合征实验的工作流程作为一个专门的案例研究呈现。