Gardeux Vincent, Chelouah Rachid, Wanderley Maria F Barbosa, Siarry Patrick, Braga Antônio P, Reyal Fabien, Rouzier Roman, Pusztai Lajos, Natowicz René
EISTI engineering school, Department of Computer Science, Cergy, France. ; LISSI laboratory, University of Paris-Est, Créteil, France.
EISTI engineering school, Department of Computer Science, Cergy, France.
Cancer Inform. 2015 Apr 19;14:33-45. doi: 10.4137/CIN.S21111. eCollection 2015.
Filter feature selection methods compute molecular signatures by selecting subsets of genes in the ranking of a valuation function. The motivations of the valuation functions choice are almost always clearly stated, but those for selecting the genes according to their ranking are hardly ever explicit.
We addressed the computation of molecular signatures by searching the optima of a bi-objective function whose solution space was the set of all possible molecular signatures, ie, the set of subsets of genes. The two objectives were the size of the signature-to be minimized-and the interclass distance induced by the signature-to be maximized-.
We showed that: 1) the convex combination of the two objectives had exactly n optimal non empty signatures where n was the number of genes, 2) the n optimal signatures were nested, and 3) the optimal signature of size k was the subset of k top ranked genes that contributed the most to the interclass distance. We applied our feature selection method on five public datasets in oncology, and assessed the prediction performances of the optimal signatures as input to the diagonal linear discriminant analysis (DLDA) classifier. They were at the same level or better than the best-reported ones. The predictions were robust, and the signatures were almost always significantly smaller. We studied in more details the performances of our predictive modeling on two breast cancer datasets to predict the response to a preoperative chemotherapy: the performances were higher than the previously reported ones, the signatures were three times smaller (11 versus 30 gene signatures), and the genes member of the signature were known to be involved in the response to chemotherapy.
Defining molecular signatures as the optima of a bi-objective function that combined the signature size and the interclass distance was well founded and efficient for prediction in oncogenomics. The complexity of the computation was very low because the optimal signatures were the sets of genes in the ranking of their valuation. Software can be freely downloaded from http://gardeux-vincent.eu/DeltaRanking.php.
过滤特征选择方法通过在评估函数的排名中选择基因子集来计算分子特征。评估函数选择的动机几乎总是明确阐述的,但根据基因排名来选择基因的动机却几乎从未明确说明。
我们通过搜索一个双目标函数的最优解来解决分子特征的计算问题,该双目标函数的解空间是所有可能的分子特征集,即基因子集的集合。这两个目标分别是特征的大小(要最小化)和由特征诱导的类间距离(要最大化)。
我们表明:1)两个目标的凸组合恰好有n个最优非空特征,其中n是基因的数量;2)这n个最优特征是嵌套的;3)大小为k的最优特征是对类间距离贡献最大的k个排名靠前的基因的子集。我们将我们的特征选择方法应用于肿瘤学的五个公共数据集,并评估了最优特征作为对角线性判别分析(DLDA)分类器输入的预测性能。它们与报告的最佳性能处于同一水平或更好。预测是稳健的,并且特征几乎总是明显更小。我们更详细地研究了我们在两个乳腺癌数据集上进行预测建模以预测术前化疗反应的性能:性能高于先前报道的性能,特征小三倍(11个基因特征对30个基因特征),并且已知特征中的基因成员参与化疗反应。
将分子特征定义为结合特征大小和类间距离的双目标函数的最优解,在肿瘤基因组学预测中是有充分依据且有效的。计算复杂度非常低,因为最优特征是其评估排名中的基因集。软件可从http://gardeux - vincent.eu/DeltaRanking.php免费下载。