基于实例的多类DNA微阵列数据概念学习

Instance-based concept learning from multiclass DNA microarray data.

作者信息

Berrar Daniel, Bradbury Ian, Dubitzky Werner

机构信息

School of Biomedical Sciences, University of Ulster at Coleraine, Cromore Road, Northern Ireland, UK.

出版信息

BMC Bioinformatics. 2006 Feb 16;7:73. doi: 10.1186/1471-2105-7-73.

DOI:10.1186/1471-2105-7-73

PMID:16483361

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1402330/

Abstract

BACKGROUND

Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance.

RESULTS

We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors.

CONCLUSION

Given its highly intuitive underlying principles--simplicity, ease-of-use, and robustness--the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.

摘要

背景

各种统计和机器学习方法已成功应用于DNA微阵列数据的分类。与更复杂的模型相比，简单的基于实例的分类器，如最近邻（NN）方法，表现非常出色，目前在生物学和生物技术数据集的分析中正在经历复兴。虽然微阵列数据的二元分类已得到广泛研究，但涉及多类数据的研究却很少。NN方法与更复杂的多类方法在性能上是否存在显著差异这一问题仍然悬而未决。该领域的比较研究通常仅基于分类准确率来评估不同模型；然而，这种方法缺乏得出可靠结论所需的严谨性，并且不足以检验性能相等的零假设。将新的分类模型与现有方法进行比较需要关注性能差异的显著性。

结果

我们研究了基于实例的分类器的性能，包括一个能够为每个样本分配类隶属度的NN分类器。该模型缓解了传统基于实例的学习器的一个主要问题，即预测缺乏置信度值。该模型将到最近邻的距离转换为“置信度分数”；置信度分数越高，所考虑的实例就越接近预定义的类。我们将这些模型应用于三个真实的基因表达数据集，并将它们与用于多类微阵列数据分类的最先进方法进行比较，使用考虑数据重采样策略的统计显著性检验来评估性能。简单的NN分类器表现与更复杂的竞争对手相当，或显著优于它们。

结论

鉴于其高度直观的基本原理——简单、易用和稳健——由合适的距离加权机制补充的k-NN分类器是多类微阵列数据集更复杂模型的优秀替代方案。使用加权距离的基于实例的分类器不仅限于微阵列数据集，而且在高维生物数据集（如高通量质谱产生的数据集）的分类中可能具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cd4/1402330/bfd8459d7a8e/1471-2105-7-73-1.jpg

相似文献

Instance-based concept learning from multiclass DNA microarray data.

BMC Bioinformatics. 2006 Feb 16;7:73. doi: 10.1186/1471-2105-7-73.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Bias in error estimation when using cross-validation for model selection.

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Multiclass classification of microarray data samples with a reduced number of genes.

BMC Bioinformatics. 2011 Feb 22;12:59. doi: 10.1186/1471-2105-12-59.

Multiclass cancer classification and biomarker discovery using GA-based algorithms.

Bioinformatics. 2005 Jun 1;21(11):2691-7. doi: 10.1093/bioinformatics/bti419. Epub 2005 Apr 6.

Comparison of feature selection and classification for MALDI-MS data.

BMC Genomics. 2009 Jul 7;10 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-10-S1-S3.

Regularized Least Squares Cancer classifiers from DNA microarray data.

BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-6-S4-S2.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.

Bioinformatics. 2004 Oct 12;20(15):2429-37. doi: 10.1093/bioinformatics/bth267. Epub 2004 Apr 15.

A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.

Bioinformatics. 2005 Mar 1;21(5):631-43. doi: 10.1093/bioinformatics/bti033. Epub 2004 Sep 16.

Is cross-validation better than resubstitution for ranking genes?

Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.

引用本文的文献

Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning.

IEEE J Transl Eng Health Med. 2014 Dec 2;2:4300211. doi: 10.1109/JTEHM.2014.2375820. eCollection 2014.

A novel harmony search-K means hybrid algorithm for clustering gene expression data.

Bioinformation. 2013;9(2):84-8. doi: 10.6026/97320630009084. Epub 2013 Jan 18.

Use of yeast chemigenomics and COXEN informatics in preclinical evaluation of anticancer agents.

Neoplasia. 2011 Jan;13(1):72-80. doi: 10.1593/neo.101214.

A hybrid BPSO-CGA approach for gene selection and classification of microarray data.

J Comput Biol. 2012 Jan;19(1):68-82. doi: 10.1089/cmb.2010.0064. Epub 2011 Jan 6.

ANMM4CBR: a case-based reasoning method for gene expression data classification.

Algorithms Mol Biol. 2010 Jan 6;5:14. doi: 10.1186/1748-7188-5-14.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.

BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56.

本文引用的文献

Multi-class clustering and prediction in the analysis of microarray data.

Math Biosci. 2005 Jan;193(1):79-100. doi: 10.1016/j.mbs.2004.07.002. Epub 2004 Dec 28.

Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods.

Bioinformatics. 2005 Mar 1;21(5):644-9. doi: 10.1093/bioinformatics/bti036. Epub 2004 Sep 16.

Joint classifier and feature optimization for comprehensive cancer diagnosis using gene expression data.

J Comput Biol. 2004;11(2-3):227-42. doi: 10.1089/1066527041410463.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.

Bioinformatics. 2004 Oct 12;20(15):2429-37. doi: 10.1093/bioinformatics/bth267. Epub 2004 Apr 15.

Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data.

BMC Bioinformatics. 2003 Dec 2;4:60. doi: 10.1186/1471-2105-4-60.

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.

Bioinformatics. 2003 Aug 12;19(12):1484-91. doi: 10.1093/bioinformatics/btg182.

A paradigm for class prediction using gene expression profiles.

J Comput Biol. 2002;9(3):505-11. doi: 10.1089/106652702760138592.

Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.

Cancer Cell. 2002 Mar;1(2):133-43. doi: 10.1016/s1535-6108(02)00032-6.

Selection bias in gene extraction on the basis of microarray gene-expression data.

Proc Natl Acad Sci U S A. 2002 May 14;99(10):6562-6. doi: 10.1073/pnas.102102699. Epub 2002 Apr 30.

Prediction of central nervous system embryonal tumour outcome based on gene expression.

Nature. 2002 Jan 24;415(6870):436-42. doi: 10.1038/415436a.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于实例的多类DNA微阵列数据概念学习

Instance-based concept learning from multiclass DNA microarray data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献