Deb Kalyanmoy, Raji Reddy A
Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of Technology Kanpur, Kanpur 208 016, India.
Biosystems. 2003 Nov;72(1-2):111-29. doi: 10.1016/s0303-2647(03)00138-2.
In the area of bioinformatics, the identification of gene subsets responsible for classifying available disease samples to two or more of its variants is an important task. Such problems have been solved in the past by means of unsupervised learning methods (hierarchical clustering, self-organizing maps, k-mean clustering, etc.) and supervised learning methods (weighted voting approach, k-nearest neighbor method, support vector machine method, etc.). Such problems can also be posed as optimization problems of minimizing gene subset size to achieve reliable and accurate classification. The main difficulties in solving the resulting optimization problem are the availability of only a few samples compared to the number of genes in the samples and the exorbitantly large search space of solutions. Although there exist a few applications of evolutionary algorithms (EAs) for this task, here we treat the problem as a multiobjective optimization problem of minimizing the gene subset size and minimizing the number of misclassified samples. Moreover, for a more reliable classification, we consider multiple training sets in evaluating a classifier. Contrary to the past studies, the use of a multiobjective EA (NSGA-II) has enabled us to discover a smaller gene subset size (such as four or five) to correctly classify 100% or near 100% samples for three cancer samples (Leukemia, Lymphoma, and Colon). We have also extended the NSGA-II to obtain multiple non-dominated solutions discovering as much as 352 different three-gene combinations providing a 100% correct classification to the Leukemia data. In order to have further confidence in the identification task, we have also introduced a prediction strength threshold for determining a sample's belonging to one class or the other. All simulation results show consistent gene subset identifications on three disease samples and exhibit the flexibilities and efficacies in using a multiobjective EA for the gene subset identification task.
在生物信息学领域,识别负责将可用疾病样本分类为两种或更多变体的基因子集是一项重要任务。过去,此类问题已通过无监督学习方法(层次聚类、自组织映射、k均值聚类等)和监督学习方法(加权投票法、k近邻法、支持向量机法等)得以解决。此类问题也可被视为优化问题,即最小化基因子集大小以实现可靠且准确的分类。解决由此产生的优化问题的主要困难在于,与样本中的基因数量相比,仅有少量样本可用,且解决方案的搜索空间极大。尽管存在一些将进化算法(EA)应用于此任务的情况,但在此我们将该问题视为一个多目标优化问题,即最小化基因子集大小并最小化错误分类样本的数量。此外,为了实现更可靠的分类,我们在评估分类器时考虑多个训练集。与过去的研究不同,使用多目标EA(NSGA-II)使我们能够发现较小的基因子集大小(例如四个或五个),从而对三种癌症样本(白血病、淋巴瘤和结肠癌)的100%或接近100%的样本进行正确分类。我们还扩展了NSGA-II以获得多个非支配解,发现多达352种不同的三基因组合,对白血病数据提供100%的正确分类。为了对识别任务更有信心,我们还引入了一个预测强度阈值来确定样本属于某一类还是另一类。所有模拟结果都显示了在三种疾病样本上一致的基因子集识别,并展示了在使用多目标EA进行基因子集识别任务时的灵活性和有效性。