Suppr超能文献

识别在微阵列中对良好分类贡献最大的基因。

Identifying genes that contribute most to good classification in microarrays.

作者信息

Baker Stuart G, Kramer Barnett S

机构信息

Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, Bethesda, MD 20892-7354, USA.

出版信息

BMC Bioinformatics. 2006 Sep 7;7:407. doi: 10.1186/1471-2105-7-407.

Abstract

BACKGROUND

The goal of most microarray studies is either the identification of genes that are most differentially expressed or the creation of a good classification rule. The disadvantage of the former is that it ignores the importance of gene interactions; the disadvantage of the latter is that it often does not provide a sufficient focus for further investigation because many genes may be included by chance. Our strategy is to search for classification rules that perform well with few genes and, if they are found, identify genes that occur relatively frequently under multiple random validation (random splits into training and test samples).

RESULTS

We analyzed data from four published studies related to cancer. For classification we used a filter with a nearest centroid rule that is easy to implement and has been previously shown to perform well. To comprehensively measure classification performance we used receiver operating characteristic curves. In the three data sets with good classification performance, the classification rules for 5 genes were only slightly worse than for 20 or 50 genes and somewhat better than for 1 gene. In two of these data sets, one or two genes had relatively high frequencies not noticeable with rules involving 20 or 50 genes: desmin for classifying colon cancer versus normal tissue; and zyxin and secretory granule proteoglycan genes for classifying two types of leukemia.

CONCLUSION

Using multiple random validation, investigators should look for classification rules that perform well with few genes and select, for further study, genes with relatively high frequencies of occurrence in these classification rules.

摘要

背景

大多数微阵列研究的目标要么是识别差异表达最显著的基因,要么是创建一个良好的分类规则。前者的缺点是忽略了基因相互作用的重要性;后者的缺点是它往往没有为进一步研究提供足够的重点,因为许多基因可能是偶然被纳入的。我们的策略是寻找使用少量基因就能表现良好的分类规则,如果找到了这样的规则,就识别在多次随机验证(随机划分为训练样本和测试样本)中相对频繁出现的基因。

结果

我们分析了四项已发表的与癌症相关研究的数据。对于分类,我们使用了一种带有最近质心规则的过滤器,该过滤器易于实现,并且先前已证明表现良好。为了全面衡量分类性能,我们使用了受试者工作特征曲线。在三个具有良好分类性能的数据集中,5个基因的分类规则仅比20个或50个基因的分类规则略差,并且比1个基因的分类规则略好。在其中两个数据集中,一两个基因具有相对较高的频率,这在涉及20个或50个基因的规则中并不明显:结蛋白用于区分结肠癌组织与正常组织;斑联蛋白和分泌颗粒蛋白聚糖基因用于区分两种白血病。

结论

使用多次随机验证时,研究人员应寻找使用少量基因就能表现良好的分类规则,并选择在这些分类规则中出现频率相对较高的基因进行进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/86da/1574352/48409f033baf/1471-2105-7-407-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验