IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain.
CISUC, ECOS Research Group, University of Coimbra, Pólo II-Pinhal de Marrocos, 3030-290, Coimbra, Portugal.
Interdiscip Sci. 2018 Mar;10(1):12-23. doi: 10.1007/s12539-017-0274-z. Epub 2018 Jan 8.
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
本文提出了一种用于基因选择的集成框架,旨在解决基因过滤任务中出现的不稳定性问题。从基因表达数据中选择基因的复杂过程与不同过滤方法找到的信息性基因子集所面临的不稳定性问题不同。这使得专家难以识别重要基因。结果的不稳定性可能来自于过滤方法、基因分类器方法、同一疾病的不同数据集和多个有效的生物标志物组。尽管有很多建议,但这个问题的复杂性仍然是一个挑战。本工作提出了一个涉及基因过滤五个阶段的框架,以发现用于诊断和分类任务的生物标志物。该框架执行一个稳定的特征选择过程,面对上述问题,从而为临床和研究目的提供更合适和可靠的解决方案。我们的提案涉及一个多阶段基因过滤过程,其中添加了几种用于基因选择的集成策略,以便不同的分类器同时评估基因子集以应对不稳定性。首先,我们应用一组最新的基因选择方法来获得所发现基因的多样性(根据过滤方法的稳定性)。接下来,我们应用一组已知的分类器来一次过滤与所有分类器相关的基因(根据分类方法的稳定性)。在同一疾病(胰腺导管腺癌)的两个不同数据集上评估了所获得的结果,以寻找针对该疾病的稳定性,取得了有希望的结果。