使用最小生成树的迭代类发现和特征选择

Iterative class discovery and feature selection using Minimal Spanning Trees.

作者信息

Varma Sudhir, Simon Richard

机构信息

Biometric Research Branch, National Cancer Institute, Rockville, USA.

出版信息

BMC Bioinformatics. 2004 Sep 8;5:126. doi: 10.1186/1471-2105-5-126.

DOI:10.1186/1471-2105-5-126

PMID:15355552

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC520744/

Abstract

BACKGROUND

Clustering is one of the most commonly used methods for discovering hidden structure in microarray gene expression data. Most current methods for clustering samples are based on distance metrics utilizing all genes. This has the effect of obscuring clustering in samples that may be evident only when looking at a subset of genes, because noise from irrelevant genes dominates the signal from the relevant genes in the distance calculation.

RESULTS

We describe an algorithm for automatically detecting clusters of samples that are discernable only in a subset of genes. We use iteration between Minimal Spanning Tree based clustering and feature selection to remove noise genes in a step-wise manner while simultaneously sharpening the clustering. Evaluation of this algorithm on synthetic data shows that it resolves planted clusters with high accuracy in spite of noise and the presence of other clusters. It also shows a low probability of detecting spurious clusters. Testing the algorithm on some well known micro-array data-sets reveals known biological classes as well as novel clusters.

CONCLUSIONS

The iterative clustering method offers considerable improvement over clustering in all genes. This method can be used to discover partitions and their biological significance can be determined by comparing with clinical correlates and gene annotations. The MATLAB programs for the iterative clustering algorithm are available from http://linus.nci.nih.gov/supplement.html

摘要

背景

聚类是在微阵列基因表达数据中发现隐藏结构最常用的方法之一。当前大多数用于样本聚类的方法都是基于利用所有基因的距离度量。这会导致在仅查看基因子集时可能明显的样本聚类变得模糊，因为在距离计算中，无关基因的噪声主导了相关基因的信号。

结果

我们描述了一种算法，用于自动检测仅在基因子集中可辨别的样本聚类。我们在基于最小生成树的聚类和特征选择之间进行迭代，以逐步去除噪声基因，同时锐化聚类。对该算法在合成数据上的评估表明，尽管存在噪声和其他聚类，它仍能高精度地解析植入的聚类。它检测到虚假聚类的概率也很低。在一些知名的微阵列数据集上测试该算法，揭示了已知的生物学类别以及新的聚类。

结论

迭代聚类方法相对于对所有基因进行聚类有显著改进。该方法可用于发现分区，其生物学意义可通过与临床关联和基因注释进行比较来确定。迭代聚类算法的MATLAB程序可从http://linus.nci.nih.gov/supplement.html获得

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ce1/520744/a716d50961ab/1471-2105-5-126-1.jpg

相似文献

Iterative class discovery and feature selection using Minimal Spanning Trees.

BMC Bioinformatics. 2004 Sep 8;5:126. doi: 10.1186/1471-2105-5-126.

DNA microarray data and contextual analysis of correlation graphs.

BMC Bioinformatics. 2003 Apr 29;4:15. doi: 10.1186/1471-2105-4-15.

Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses.

Artif Intell Med. 2006 Jun;37(2):85-109. doi: 10.1016/j.artmed.2006.03.005. Epub 2006 May 23.

Cluster stability scores for microarray data in cancer studies.

BMC Bioinformatics. 2003 Sep 6;4:36. doi: 10.1186/1471-2105-4-36.

A gene selection algorithm based on the gene regulation probability using maximal likelihood estimation.

Biotechnol Lett. 2005 Apr;27(8):597-603. doi: 10.1007/s10529-005-3253-0.

Determination of the differentially expressed genes in microarray experiments using local FDR.

BMC Bioinformatics. 2004 Sep 6;5:125. doi: 10.1186/1471-2105-5-125.

Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure.

J Comput Biol. 2004;11(4):660-82. doi: 10.1089/1066527041887285.

Detecting clusters of different geometrical shapes in microarray gene expression data.

Bioinformatics. 2005 May 1;21(9):1927-34. doi: 10.1093/bioinformatics/bti251. Epub 2005 Jan 12.

Mass distributed clustering: a new algorithm for repeated measurements in gene expression data.

Genome Inform. 2005;16(2):183-94.

Multi-class clustering and prediction in the analysis of microarray data.

Math Biosci. 2005 Jan;193(1):79-100. doi: 10.1016/j.mbs.2004.07.002. Epub 2004 Dec 28.

引用本文的文献

A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information.

PLoS One. 2014 May 20;9(5):e97530. doi: 10.1371/journal.pone.0097530. eCollection 2014.

A unified computational model for revealing and predicting subtle subtypes of cancers.

BMC Bioinformatics. 2012 May 1;13:70. doi: 10.1186/1471-2105-13-70.

HAMSTER: visualizing microarray experiments as a set of minimum spanning trees.

Source Code Biol Med. 2009 Nov 20;4:8. doi: 10.1186/1751-0473-4-8.

Gene selection for classification of microarray data based on the Bayes error.

BMC Bioinformatics. 2007 Oct 3;8(1):370. doi: 10.1186/1471-2105-8-370.

Individualized markers optimize class prediction of microarray data.

BMC Bioinformatics. 2006 Jul 14;7:345. doi: 10.1186/1471-2105-7-345.

Biol Direct. 2006 May 30;1:13. doi: 10.1186/1745-6150-1-13.

本文引用的文献

ESPD: a pattern detection model underlying gene expression profiles.

Bioinformatics. 2004 Apr 12;20(6):829-38. doi: 10.1093/bioinformatics/btg486. Epub 2004 Jan 29.

Gene expression profiling identifies clinically relevant subtypes of prostate cancer.

Proc Natl Acad Sci U S A. 2004 Jan 20;101(3):811-6. doi: 10.1073/pnas.0304146101. Epub 2004 Jan 7.

GoMiner: a resource for biological interpretation of genomic and proteomic data.

Genome Biol. 2003;4(4):R28. doi: 10.1186/gb-2003-4-4-r28. Epub 2003 Mar 25.

Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees.

Bioinformatics. 2002 Apr;18(4):536-45. doi: 10.1093/bioinformatics/18.4.536.

CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts.

Bioinformatics. 2001;17 Suppl 1:S306-15. doi: 10.1093/bioinformatics/17.suppl_1.s306.

Identifying splits with clear separation: a new class discovery method for gene expression data.

Bioinformatics. 2001;17 Suppl 1:S107-14. doi: 10.1093/bioinformatics/17.suppl_1.s107.

Gene-expression profiles in hereditary breast cancer.

N Engl J Med. 2001 Feb 22;344(8):539-48. doi: 10.1056/NEJM200102223440801.

'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.

Genome Biol. 2000;1(2):RESEARCH0003. doi: 10.1186/gb-2000-1-2-research0003. Epub 2000 Aug 4.

Molecular classification of cutaneous malignant melanoma by gene expression profiling.

Nature. 2000 Aug 3;406(6795):536-40. doi: 10.1038/35020115.

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.

Nature. 2000 Feb 3;403(6769):503-11. doi: 10.1038/35000501.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用最小生成树的迭代类发现和特征选择

Iterative class discovery and feature selection using Minimal Spanning Trees.

作者信息

Varma Sudhir, Simon Richard

机构信息

Biometric Research Branch, National Cancer Institute, Rockville, USA.