Suppr超能文献

微阵列转录数据中存在许多准确的小判别特征子集:生物标志物发现。

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.

作者信息

Grate Leslie R

机构信息

Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.

出版信息

BMC Bioinformatics. 2005 Apr 13;6:97. doi: 10.1186/1471-2105-6-97.

Abstract

BACKGROUND

Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.

RESULTS

We use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.

CONCLUSION

This technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

摘要

背景

分子谱分析可生成生物样本(如正常组织和肿瘤组织)中数千个基因转录本的丰度测量值(数据点)。对于此类两类高维数据,已经提出了许多方法将数据点分类为两类之一。然而,由于基本数学命题困难,找到能够正确分类数据的非常小的特征集存在问题。现有方法可以找到“小”特征集,但没有提示这与真正的最小规模有多接近。在没有基本数学进展的情况下,找到真正的最小规模集将仍然难以实现,更重要的是对于微阵列领域来说,将没有找到它们的方法。

结果

我们使用对所有基因、基因对(以及某些数据集的基因三元组)进行穷举搜索的暴力方法。使用几参数线性超平面分类方法分析每个独特的基因组合,寻找那些形成无训练误差分类器的组合。研究的所有10个已发表数据集都发现包含预测性小特征集。其中4个包含数千个基因对,6个有能完美区分的单个基因。

结论

该技术在已发表数据中发现了形成准确分类器的小基因集(3个或更少),而这些在先前的出版物中并未报道。这可能是微阵列数据的一个共同特征,因此寻找它们值得付出计算成本。这样的小基因集可能指示生物标志物,并预示着简单的医学诊断测试。我们建议常规检查小基因集。我们在Chen等人的大型肝细胞癌(HCC,肝癌)数据集中发现了4个基因对和许多基因三元组。其中的关键成分是“功能未知的胎盘基因”PLAC8。我们的隐马尔可夫模型(HMM)建模表明,PLAC8可能具有类似于lP59晶体结构(一种非共价核酸内切酶lii - Dna复合物)一部分的结构域。先前鉴定的肝癌生物标志物基因磷脂酰肌醇蛋白聚糖3(GPC3)是涉及MT1E和ARHE的准确基因三元组的一部分。我们还在Yeoh等人的大型儿童急性淋巴细胞白血病癌症数据集中发现了区分白血病亚型的小基因集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8768/1090559/0c25aceba07b/1471-2105-6-97-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验