微阵列转录数据中存在许多准确的小判别特征子集：生物标志物发现。

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.

作者信息

Grate Leslie R

机构信息

Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.

出版信息

BMC Bioinformatics. 2005 Apr 13;6:97. doi: 10.1186/1471-2105-6-97.

DOI:10.1186/1471-2105-6-97

PMID:15826317

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1090559/

Abstract

BACKGROUND

Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.

RESULTS

We use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.

CONCLUSION

This technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

摘要

背景

分子谱分析可生成生物样本（如正常组织和肿瘤组织）中数千个基因转录本的丰度测量值（数据点）。对于此类两类高维数据，已经提出了许多方法将数据点分类为两类之一。然而，由于基本数学命题困难，找到能够正确分类数据的非常小的特征集存在问题。现有方法可以找到“小”特征集，但没有提示这与真正的最小规模有多接近。在没有基本数学进展的情况下，找到真正的最小规模集将仍然难以实现，更重要的是对于微阵列领域来说，将没有找到它们的方法。

结果

我们使用对所有基因、基因对（以及某些数据集的基因三元组）进行穷举搜索的暴力方法。使用几参数线性超平面分类方法分析每个独特的基因组合，寻找那些形成无训练误差分类器的组合。研究的所有10个已发表数据集都发现包含预测性小特征集。其中4个包含数千个基因对，6个有能完美区分的单个基因。

结论

该技术在已发表数据中发现了形成准确分类器的小基因集（3个或更少），而这些在先前的出版物中并未报道。这可能是微阵列数据的一个共同特征，因此寻找它们值得付出计算成本。这样的小基因集可能指示生物标志物，并预示着简单的医学诊断测试。我们建议常规检查小基因集。我们在Chen等人的大型肝细胞癌（HCC，肝癌）数据集中发现了4个基因对和许多基因三元组。其中的关键成分是“功能未知的胎盘基因”PLAC8。我们的隐马尔可夫模型（HMM）建模表明，PLAC8可能具有类似于lP59晶体结构（一种非共价核酸内切酶lii - Dna复合物）一部分的结构域。先前鉴定的肝癌生物标志物基因磷脂酰肌醇蛋白聚糖3（GPC3）是涉及MT1E和ARHE的准确基因三元组的一部分。我们还在Yeoh等人的大型儿童急性淋巴细胞白血病癌症数据集中发现了区分白血病亚型的小基因集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8768/1090559/0c25aceba07b/1471-2105-6-97-1.jpg

相似文献

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.微阵列转录数据中存在许多准确的小判别特征子集：生物标志物发现。

BMC Bioinformatics. 2005 Apr 13;6:97. doi: 10.1186/1471-2105-6-97.

In silico microdissection of microarray data from heterogeneous cell populations.对来自异质细胞群体的微阵列数据进行计算机模拟显微切割。

BMC Bioinformatics. 2005 Mar 14;6:54. doi: 10.1186/1471-2105-6-54.

Gene expression analysis in clear cell renal cell carcinoma using gene set enrichment analysis for biostatistical management.基于基因集富集分析的 clear cell 肾细胞癌基因表达分析用于生物统计学管理。

BJU Int. 2011 Jul;108(2 Pt 2):E29-35. doi: 10.1111/j.1464-410X.2010.09794.x. Epub 2011 Mar 16.

Regularized Least Squares Cancer classifiers from DNA microarray data.基于DNA微阵列数据的正则化最小二乘癌症分类器。

BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-6-S4-S2.

Tumor classification ranking from microarray data.基于微阵列数据的肿瘤分类排名

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S21. doi: 10.1186/1471-2164-9-S2-S21.

Accurate cancer classification using expressions of very few genes.利用极少基因的表达进行精确的癌症分类。

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jan-Mar;4(1):40-53. doi: 10.1109/TCBB.2007.1006.

Meta-analysis of cancer gene-profiling data.癌症基因谱数据的荟萃分析。

Methods Mol Biol. 2010;576:409-26. doi: 10.1007/978-1-59745-545-9_21.

What should be expected from feature selection in small-sample settings.在小样本情况下，特征选择应达到什么预期效果。

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

A stable iterative method for refining discriminative gene clusters.一种用于优化鉴别性基因簇的稳定迭代方法。

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S18. doi: 10.1186/1471-2164-9-S2-S18.

Classification of microarray data with factor mixture models.基于因子混合模型的微阵列数据分类

Bioinformatics. 2006 Jan 15;22(2):202-8. doi: 10.1093/bioinformatics/bti779. Epub 2005 Nov 15.

引用本文的文献

Multifaced roles of PLAC8 in cancer.PLAC8在癌症中的多方面作用。

Biomark Res. 2021 Oct 9;9(1):73. doi: 10.1186/s40364-021-00329-1.

Coagulation FXIII-A Protein Expression Defines Three Novel Sub-populations in Pediatric B-Cell Progenitor Acute Lymphoblastic Leukemia Characterized by Distinct Gene Expression Signatures.凝血因子 XIII-A 蛋白表达定义了小儿 B 细胞祖细胞急性淋巴细胞白血病中的三个新亚群，其特征为具有不同的基因表达特征。

Front Oncol. 2019 Oct 25;9:1063. doi: 10.3389/fonc.2019.01063. eCollection 2019.

The novel KLF4/PLAC8 signaling pathway regulates lung cancer growth.新型 KLF4/PLAC8 信号通路调控肺癌生长。

Cell Death Dis. 2018 May 22;9(6):603. doi: 10.1038/s41419-018-0580-3.

Multiclass cancer classification based on gene expression comparison.基于基因表达比较的多类癌症分类

Stat Appl Genet Mol Biol. 2014 Aug;13(4):477-96. doi: 10.1515/sagmb-2013-0053.

An integrated approach for identifying wrongly labelled samples when performing classification in microarray data.一种在微阵列数据分析中进行分类时识别错误标记样本的综合方法。

PLoS One. 2012;7(10):e46700. doi: 10.1371/journal.pone.0046700. Epub 2012 Oct 17.

Highly sensitive molecular diagnosis of prostate cancer using surplus material washed off from biopsy needles.使用从活检针上冲洗下来的剩余物质进行前列腺癌的高灵敏度分子诊断。

Br J Cancer. 2011 Nov 8;105(10):1600-7. doi: 10.1038/bjc.2011.435. Epub 2011 Oct 18.

High-dimensional bolstered error estimation.高维增强误差估计。

Bioinformatics. 2011 Nov 1;27(21):3056-64. doi: 10.1093/bioinformatics/btr518. Epub 2011 Sep 13.

Learning biomarkers of pluripotent stem cells in mouse.学习小鼠多能干细胞的生物标志物。

DNA Res. 2011 Aug;18(4):233-51. doi: 10.1093/dnares/dsr016. Epub 2011 Jul 26.

Rough set soft computing cancer classification and network: one stone, two birds.粗糙集软计算癌症分类与网络：一石二鸟。

Cancer Inform. 2010 Jul 15;9:139-45. doi: 10.4137/cin.s4874.

Analysis and computational dissection of molecular signature multiplicity.分析与计算剖析分子特征的多重性。

PLoS Comput Biol. 2010 May 20;6(5):e1000790. doi: 10.1371/journal.pcbi.1000790.

本文引用的文献

Robust sparse hyperplane classifiers: application to uncertain molecular profiling data.鲁棒稀疏超平面分类器：应用于不确定分子谱数据

J Comput Biol. 2004;11(6):1073-89. doi: 10.1089/cmb.2004.11.1073.

Clinical proteomics: written in blood.临床蛋白质组学：书写在血液之中。

Nature. 2003 Oct 30;425(6961):905. doi: 10.1038/425905a.

Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease.多发性硬化症患者与健康对照的基因表达谱：识别与疾病相关的通路

Hum Mol Genet. 2003 Sep 1;12(17):2191-9. doi: 10.1093/hmg/ddg221. Epub 2003 Jul 8.

Identification of signature genes by microarray for acute myeloid leukemia without maturation and acute promyelocytic leukemia with t(15;17)(q22;q12)(PML/RARalpha).通过微阵列鉴定无成熟型急性髓系白血病和伴有t(15;17)(q22;q12)(PML/RARα)的急性早幼粒细胞白血病的特征基因。

Int J Oncol. 2003 Sep;23(3):617-25.

Glypican-3 is overexpressed in human hepatocellular carcinoma.磷脂酰肌醇蛋白聚糖-3在人类肝细胞癌中过度表达。

Cancer Sci. 2003 Mar;94(3):259-62. doi: 10.1111/j.1349-7006.2003.tb01430.x.

Microarray reveals differences in both tumors and vascular specific gene expression in de novo CD5+ and CD5- diffuse large B-cell lymphomas.基因芯片揭示原发性CD5阳性和CD5阴性弥漫性大B细胞淋巴瘤在肿瘤及血管特异性基因表达上的差异。

Cancer Res. 2003 Jan 1;63(1):60-6.

Identification of combination gene sets for glioma classification.用于胶质瘤分类的联合基因集的鉴定。

Mol Cancer Ther. 2002 Nov;1(13):1229-36.

Gene-expression profiles predict survival of patients with lung adenocarcinoma.基因表达谱可预测肺腺癌患者的生存情况。

Nat Med. 2002 Aug;8(8):816-24. doi: 10.1038/nm733. Epub 2002 Jul 15.

Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers.与BRCA1相关、BRCA2相关以及散发性卵巢癌的基因表达谱。

J Natl Cancer Inst. 2002 Jul 3;94(13):990-1000. doi: 10.1093/jnci/94.13.990.

Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.通过基因表达谱分析对儿童急性淋巴细胞白血病进行分类、亚型发现及预后预测。

Cancer Cell. 2002 Mar;1(2):133-43. doi: 10.1016/s1535-6108(02)00032-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

微阵列转录数据中存在许多准确的小判别特征子集：生物标志物发现。

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献