Graduate School of Media and Governance, Keio University, Fujisawa, Kanagawa, 252-8520, Japan.
Research Fellow of Japan Society for the Promotion of Science, Chiyoda, Tokyo, 102-0083, Japan.
Sci Rep. 2020 May 13;10(1):7923. doi: 10.1038/s41598-020-64870-z.
Predictions of distant cancer metastasis based on gene signatures are studied intensively to realise precise diagnosis and treatments. Gene selection i.e. feature selection is a cornerstone to both establish accurate predictions and understand underlying pathologies. Here, we developed a simple but robust feature selection method using a correlation-centred approach to select minimal gene sets that have both high predictive and generalisation abilities. A multiple logistic regression model was used to predict 5-year metastases of patients with breast cancer. Gene expression data obtained from tumour samples of lymph node-negative breast cancer patients were randomly split into training and validation data. Our method selected 12 genes using training data and this showed a higher area under the receiver operating characteristic curve of 0.730 compared with 0.579 yielded by previously reported 76 genes. The signature with the predictive model was validated in an independent dataset, and its higher generalization ability was observed. Gene ontology analyses revealed that our method consistently selected genes with identical functions which frequently selected by the 76 genes. Taken together, our method identifies fewer gene sets bearing high predictive abilities, which would be versatile and applicable to predict other factors such as the outcomes of medical treatments and prognoses of other cancer types.
基于基因特征的远处癌症转移预测受到了广泛的研究,以实现精确的诊断和治疗。基因选择(即特征选择)是建立准确预测和理解潜在病理的基石。在这里,我们开发了一种简单但稳健的特征选择方法,使用基于相关性的方法选择具有高预测能力和通用性的最小基因集。使用多变量逻辑回归模型预测乳腺癌患者的 5 年转移。从淋巴结阴性乳腺癌患者的肿瘤样本中获得的基因表达数据随机分为训练数据和验证数据。我们的方法使用训练数据选择了 12 个基因,与之前报道的 76 个基因的 0.579 相比,这显示出更高的接收者操作特征曲线下面积 0.730。预测模型的特征在独立数据集得到验证,观察到其具有更高的泛化能力。基因本体分析表明,我们的方法一致地选择了具有相同功能的基因,这些基因经常被 76 个基因选择。总之,我们的方法确定了更少的具有高预测能力的基因集,这些基因集具有通用性,可用于预测其他因素,如医疗治疗结果和其他癌症类型的预后。