IEEE/ACM Trans Comput Biol Bioinform. 2019 Nov-Dec;16(6):1970-1985. doi: 10.1109/TCBB.2018.2837095. Epub 2018 May 16.
The goal of the human genome project is to integrate genetic information into different clinical therapies. To achieve this goal, different computational algorithms are devised for identifying the biomarker genes, cause of complex diseases. However, most of the methods developed so far using DNA microarray data lack in interpreting biological findings and are less accurate in disease prediction. In the paper, we propose two parameters risk_factor and confusion_factor to identify the biologically significant genes for cancer development. First, we evaluate risk_factor of each gene and the genes with nonzero risk_factor result misclassification of data, therefore removed. Next, we calculate confusion_factor of the remaining genes which determines confusion of a gene in prediction due to closeness of the samples in the cancer and normal classes. We apply nondominated sorting genetic algorithm (NSGA-II) to select the maximally uncorrelated differentially expressed genes in the cancer class with minimum confusion_factor. The proposed Gene Selection Explore (GSE) algorithm is compared to well established feature selection algorithms using 10 microarray data with respect to sensitivity, specificity, and accuracy. The identified genes appear in KEGG pathway and have several biological importance.
人类基因组计划的目标是将遗传信息整合到不同的临床治疗中。为了实现这一目标,设计了不同的计算算法来识别生物标志物基因,以确定复杂疾病的原因。然而,迄今为止使用 DNA 微阵列数据开发的大多数方法在解释生物学发现方面缺乏准确性,在疾病预测方面的准确性也较低。在本文中,我们提出了两个参数风险因子和混淆因子,以识别癌症发展的生物学意义上的重要基因。首先,我们评估每个基因的风险因子,如果某个基因的风险因子不为零,则说明该基因的数据分类错误,因此将其删除。接下来,我们计算剩余基因的混淆因子,该因子确定了由于癌症和正常样本在接近程度上的差异,一个基因在预测中的混淆程度。我们应用非支配排序遗传算法 (NSGA-II) 来选择癌症类中具有最小混淆因子的最大不相关差异表达基因。使用 10 个微阵列数据集,将所提出的基因选择探索 (GSE) 算法与成熟的特征选择算法进行了比较,比较了它们的敏感性、特异性和准确性。所鉴定的基因出现在 KEGG 途径中,具有几个生物学重要性。