Wang Wenhui, Xie Guanglei, Ren Zhonglu, Xie Tingyan, Li Jinming
Network Information Center, The Sixth Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China.
National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China.
Curr Mol Med. 2020;20(6):415-428. doi: 10.2174/1566524019666191119105209.
Colorectal cancer (CRC) is the third most common cancer worldwide. Cancer discrimination is a typical application of gene expression analysis using a microarray technique. However, microarray data suffer from the curse of dimensionality and usual imbalanced class distribution between the majority (tumor samples) and minority (normal samples) classes. Feature gene selection is necessary and important for cancer discrimination.
To select feature genes for the discrimination of CRC.
We improve the feature selection algorithm based on differential evolution, DEFSw by using RUSBoost classifier and weight accuracy instead of the common classifier and evaluation measure for selecting feature genes from imbalance data. We firstly extract differently expressed genes (DEGs) from the CRC dataset of the TCGA and then select the feature genes from the DEGs using the improved DEFSw algorithm. Finally, we validate the selected feature gene sets using independent datasets and retrieve the cancer related information for these genes based on text mining through the Coremine Medical online database.
We select out 16 single-gene feature sets for colorectal cancer discrimination and 19 single-gene feature sets only for colon cancer discrimination.
In summary, we find a series of high potential candidate biomarkers or signatures, which can discriminate either or both of colon cancer and rectal cancer with high sensitivity and specificity.
结直肠癌(CRC)是全球第三大常见癌症。癌症鉴别是使用微阵列技术进行基因表达分析的典型应用。然而,微阵列数据存在维度灾难问题,并且在多数类(肿瘤样本)和少数类(正常样本)之间通常存在类分布不平衡的情况。特征基因选择对于癌症鉴别来说是必要且重要的。
选择用于鉴别结直肠癌的特征基因。
我们通过使用RUSBoost分类器和权重准确率改进了基于差分进化的特征选择算法DEFSw,取代了用于从不平衡数据中选择特征基因的常用分类器和评估指标。我们首先从TCGA的CRC数据集中提取差异表达基因(DEG),然后使用改进的DEFSw算法从这些DEG中选择特征基因。最后,我们使用独立数据集验证所选的特征基因集,并通过Coremine Medical在线数据库基于文本挖掘检索这些基因的癌症相关信息。
我们选出了16个用于鉴别结直肠癌的单基因特征集和19个仅用于鉴别结肠癌的单基因特征集。
总之,我们发现了一系列具有高潜力的候选生物标志物或特征,它们能够以高灵敏度和特异性鉴别结肠癌和直肠癌中的一种或两种。