• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于模型的熵的基因表达特征选择。

Feature selection for gene expression using model-based entropy.

机构信息

NEC Laboratories America, 10080 North Wolfe Road, Cupertino, CA 95014, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):25-36. doi: 10.1109/TCBB.2008.35.

DOI:10.1109/TCBB.2008.35
PMID:20150666
Abstract

Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.

摘要

基因表达数据通常包含大量的基因,但只有少量的样本。基因表达数据的特征选择旨在找到一组能够最好地区分不同类型生物样本的基因。使用机器学习技术,基于经验互信息的传统基因选择由于样本数量较少而存在数据稀疏性问题。为了克服稀疏性问题,我们提出了一种基于模型的方法来估计模型上的类变量熵,而不是在数据本身上。在这里,我们使用多元正态分布来拟合数据,因为多元正态分布在具有指定均值和标准差的所有实值分布中具有最大的熵,并且广泛用于近似各种分布。给定数据遵循多元正态分布,由于给定所选特征的类变量的条件分布是正态分布,因此可以使用其协方差矩阵的对数行列式来计算其熵。由于基因数量众多,计算所有可能的对数行列式的效率不高。我们提出了几种算法来大大降低计算成本。在七个基因数据集上的实验以及与其他五种方法的比较表明了多元高斯生成模型在特征选择中的准确性,以及我们算法的效率。

相似文献

1
Feature selection for gene expression using model-based entropy.基于模型的熵的基因表达特征选择。
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):25-36. doi: 10.1109/TCBB.2008.35.
2
Optimal number of features as a function of sample size for various classification rules.针对各种分类规则,作为样本大小函数的最优特征数量。
Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.
3
Fuzzy-rough sets for information measures and selection of relevant genes from microarray data.用于信息度量及从微阵列数据中选择相关基因的模糊粗糙集
IEEE Trans Syst Man Cybern B Cybern. 2010 Jun;40(3):741-52. doi: 10.1109/TSMCB.2009.2028433. Epub 2009 Nov 3.
4
What should be expected from feature selection in small-sample settings.在小样本情况下,特征选择应达到什么预期效果。
Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.
5
Fast calculation of pairwise mutual information for gene regulatory network reconstruction.用于基因调控网络重建的成对互信息的快速计算。
Comput Methods Programs Biomed. 2009 May;94(2):177-80. doi: 10.1016/j.cmpb.2008.11.003. Epub 2009 Jan 22.
6
A blocking strategy to improve gene selection for classification of gene expression data.一种用于改进基因选择以对基因表达数据进行分类的阻断策略。
IEEE/ACM Trans Comput Biol Bioinform. 2007 Apr-Jun;4(2):293-300. doi: 10.1109/TCBB.2007.1014.
7
Combining sequence and time series expression data to learn transcriptional modules.结合序列和时间序列表达数据以学习转录模块。
IEEE/ACM Trans Comput Biol Bioinform. 2005 Jul-Sep;2(3):194-202. doi: 10.1109/TCBB.2005.34.
8
Genetic test bed for feature selection.用于特征选择的基因测试平台。
Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.
9
A stable iterative method for refining discriminative gene clusters.一种用于优化鉴别性基因簇的稳定迭代方法。
BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S18. doi: 10.1186/1471-2164-9-S2-S18.
10
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.聚类验证指标的加权排序聚合:一种蒙特卡洛交叉熵方法。
Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.

引用本文的文献

1
Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes.使用遗传算法进行癌症分类以识别生物标志物基因。
J Healthc Eng. 2022 Feb 22;2022:5821938. doi: 10.1155/2022/5821938. eCollection 2022.
2
Estimating Differential Entropy using Recursive Copula Splitting.使用递归Copula分裂估计微分熵。
Entropy (Basel). 2020 Feb 19;22(2):236. doi: 10.3390/e22020236.
3
Multiplatform biomarker identification using a data-driven approach enables single-sample classification.采用数据驱动的方法进行多平台生物标志物鉴定可实现单一样本分类。
BMC Bioinformatics. 2019 Nov 21;20(1):601. doi: 10.1186/s12859-019-3140-7.
4
Intra- and Inter-individual Variability of microRNA Levels in Human Cerebrospinal Fluid: Critical Implications for Biomarker Discovery.人脑脊液中 microRNA 水平的个体内和个体间变异性:对生物标志物发现的关键影响。
Sci Rep. 2017 Oct 5;7(1):12720. doi: 10.1038/s41598-017-13031-w.
5
AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity.AVC:通过最大化变量互补性,基于曲线下面积选择判别特征。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):50. doi: 10.1186/s12859-017-1468-4.
6
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。
BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.
7
Informative gene selection and the direct classification of tumors based on relative simplicity.基于相对简易性的信息性基因选择与肿瘤的直接分类
BMC Bioinformatics. 2016 Jan 20;17:44. doi: 10.1186/s12859-016-0893-0.
8
Binary matrix shuffling filter for feature selection in neuronal morphology classification.用于神经元形态分类中特征选择的二元矩阵重排滤波器
Comput Math Methods Med. 2015;2015:626975. doi: 10.1155/2015/626975. Epub 2015 Mar 29.
9
Informative gene selection and direct classification of tumor based on Chi-square test of pairwise gene interactions.基于成对基因相互作用的卡方检验进行肿瘤的信息基因选择与直接分类。
Biomed Res Int. 2014;2014:589290. doi: 10.1155/2014/589290. Epub 2014 Jul 23.
10
iPcc: a novel feature extraction method for accurate disease class discovery and prediction.iPcc:一种用于准确发现和预测疾病类别的新型特征提取方法。
Nucleic Acids Res. 2013 Aug;41(14):e143. doi: 10.1093/nar/gkt343. Epub 2013 Jun 12.