基于相关性的多类基因表达数据特征选择技术中相关性与冗余性之间的差异优先级排序

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.

作者信息

Ooi Chia Huey, Chetty Madhu, Teng Shyh Wei

机构信息

Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia.

出版信息

BMC Bioinformatics. 2006 Jun 23;7:320. doi: 10.1186/1471-2105-7-320.

DOI:10.1186/1471-2105-7-320

PMID:16796748

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1569877/

Abstract

BACKGROUND

Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.

RESULTS

We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets.

CONCLUSION

For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.

摘要

背景

由于典型微阵列数据集中存在大量基因，特征选择在基于基因表达的组织分类中，对于减少噪声和计算成本、同时提高准确性方面似乎起着重要作用。令人惊讶的是，并非所有多类微阵列数据集都是如此。原因在于，应用于微阵列数据集的许多特征选择技术要么基于排序，因此没有考虑基因之间的相关性，要么基于包装法，这需要很高的计算成本，而且往往产生难以重现的结果。在考虑基因相关性的研究中，对所提出技术优点的评估因评估程序不够细致而受到阻碍，导致对准确性的估计过于乐观。

结果

我们提出了两种经过实际评估的基于相关性的特征选择技术，除了形成预测集所涉及的两个现有标准（相关性和冗余性）之外，还纳入了第三个标准，即差异优先级程度（DDP）。DDP作为一个参数，在相关性和冗余性之间取得平衡，使我们的技术具有以不同方式优先优化相关性与冗余性（反之亦然）的新能力。对于九个著名的多类微阵列数据集，这种能力在使用合理小的预测集规模的同时，有助于产生最佳分类准确率。

结论

对于多类微阵列数据集，尤其是GCM和NCI60数据集，DDP使我们基于过滤的技术能够产生比以往采用类似实际评估程序的研究报告中更高的准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94c0/1569877/9154f9c198e5/1471-2105-7-320-1.jpg

相似文献

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.基于相关性的多类基因表达数据特征选择技术中相关性与冗余性之间的差异优先级排序

BMC Bioinformatics. 2006 Jun 23;7:320. doi: 10.1186/1471-2105-7-320.

Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes.用于微阵列数据分析的特征选择与分类：识别预测基因的进化方法

BMC Bioinformatics. 2005 Jun 15;6:148. doi: 10.1186/1471-2105-6-148.

Characteristics of predictor sets found using differential prioritization.使用差异优先级排序找到的预测集的特征。

Algorithms Mol Biol. 2007 Jun 4;2:7. doi: 10.1186/1748-7188-2-7.

An entropy-based gene selection method for cancer classification using microarray data.一种基于熵的利用微阵列数据进行癌症分类的基因选择方法。

BMC Bioinformatics. 2005 Mar 24;6:76. doi: 10.1186/1471-2105-6-76.

The feature selection bias problem in relation to high-dimensional gene data.与高维基因数据相关的特征选择偏差问题。

Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.

Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data.从微阵列数据生成差异表达基因列表的方法的比较与评估

BMC Bioinformatics. 2006 Jul 26;7:359. doi: 10.1186/1471-2105-7-359.

Filter versus wrapper gene selection approaches in DNA microarray domains.DNA微阵列领域中过滤法与包装法基因选择方法

Artif Intell Med. 2004 Jun;31(2):91-103. doi: 10.1016/j.artmed.2004.01.007.

Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers.从微阵列数据中选择最少数量的相关基因以设计精确的组织分类器。

Biosystems. 2007 Jul-Aug;90(1):78-86. doi: 10.1016/j.biosystems.2006.07.002. Epub 2006 Jul 10.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.基于基因表达的组织分类中特征选择与多类分类方法的比较研究

Bioinformatics. 2004 Oct 12;20(15):2429-37. doi: 10.1093/bioinformatics/bth267. Epub 2004 Apr 15.

Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

引用本文的文献

Prediction of long-term hospitalisation and all-cause mortality in patients with chronic heart failure on Dutch claims data: a machine learning approach.荷兰理赔数据中慢性心力衰竭患者长期住院和全因死亡率的预测：一种机器学习方法。

BMC Med Inform Decis Mak. 2021 Nov 1;21(1):303. doi: 10.1186/s12911-021-01657-w.

TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection.TSG：一种用于二分类和多分类癌症分类及信息基因选择的新算法。

BMC Med Genomics. 2013;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1755-8794-6-S1-S3. Epub 2013 Jan 23.

Improving accuracy for cancer classification with a new algorithm for genes selection.利用新的基因选择算法提高癌症分类的准确性。

BMC Bioinformatics. 2012 Nov 13;13:298. doi: 10.1186/1471-2105-13-298.

Systems biological approach of molecular descriptors connectivity: optimal descriptors for oral bioavailability prediction.系统生物学方法的分子描述符连接：用于口服生物利用度预测的最佳描述符。

PLoS One. 2012;7(7):e40654. doi: 10.1371/journal.pone.0040654. Epub 2012 Jul 16.

Gene selection for classification of microarray data based on the Bayes error.基于贝叶斯误差的微阵列数据分类基因选择

BMC Bioinformatics. 2007 Oct 3;8(1):370. doi: 10.1186/1471-2105-8-370.

Characteristics of predictor sets found using differential prioritization.使用差异优先级排序找到的预测集的特征。

Algorithms Mol Biol. 2007 Jun 4;2:7. doi: 10.1186/1748-7188-2-7.

本文引用的文献

Proteomic analysis of primary cell lines identifies protein changes present in renal cell carcinoma.原代细胞系的蛋白质组学分析确定了肾细胞癌中存在的蛋白质变化。

Proteomics. 2006 May;6(9):2853-64. doi: 10.1002/pmic.200500549.

Signal therapy of human pancreatic cancer and NF1-deficient breast cancer xenograft in mice by a combination of PP1 and GL-2003, anti-PAK1 drugs (Tyr-kinase inhibitors).PP1和GL-2003（抗PAK1药物，酪氨酸激酶抑制剂）联合对人胰腺癌及NF1缺陷型乳腺癌小鼠异种移植瘤的信号治疗

Cancer Lett. 2007 Jan 8;245(1-2):242-51. doi: 10.1016/j.canlet.2006.01.018. Epub 2006 Mar 15.

Ovarian cancer, the coagulation pathway, and inflammation.卵巢癌、凝血途径与炎症

J Transl Med. 2005 Jun 21;3:25. doi: 10.1186/1479-5876-3-25.

BMC Bioinformatics. 2005 Jun 15;6:148. doi: 10.1186/1471-2105-6-148.

Minimum redundancy feature selection from microarray gene expression data.从微阵列基因表达数据中进行最小冗余特征选择。

J Bioinform Comput Biol. 2005 Apr;3(2):185-205. doi: 10.1142/s0219720005001004.

Multiclass cancer classification and biomarker discovery using GA-based algorithms.使用基于遗传算法的算法进行多类别癌症分类和生物标志物发现。

Bioinformatics. 2005 Jun 1;21(11):2691-7. doi: 10.1093/bioinformatics/bti419. Epub 2005 Apr 6.

Transcriptional coactivator Drosophila eyes absent homologue 2 is up-regulated in epithelial ovarian cancer and promotes tumor growth.转录共激活因子果蝇无眼同源物2在上皮性卵巢癌中上调并促进肿瘤生长。

Cancer Res. 2005 Feb 1;65(3):925-32.

Identification of novel Myc target genes with a potential role in lymphomagenesis.鉴定在淋巴瘤发生中具有潜在作用的新型Myc靶基因。

Nucleic Acids Res. 2004 Oct 11;32(18):5368-78. doi: 10.1093/nar/gkh877. Print 2004.

Translocation of lysophosphatidic acid phosphatase in response to gonadotropin-releasing hormone to the plasma membrane in ovarian cancer cell.

Am J Obstet Gynecol. 2004 Jul;191(1):143-9. doi: 10.1016/j.ajog.2004.01.038.

The 'subsequent artificial neural network' (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses.

Bioinformatics. 2004 Dec 12;20(18):3544-52. doi: 10.1093/bioinformatics/bth441. Epub 2004 Jul 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于相关性的多类基因表达数据特征选择技术中相关性与冗余性之间的差异优先级排序

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献