基因表达数据分析中基于理论与数据驱动的综合特征选择

Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.

作者信息

Raghu Vineet K, Ge Xiaoyu, Chrysanthis Panos K, Benos Panayiotis V

机构信息

Department of Computer Science, University of Pittsburgh.

Department of Computational and Systems Biology, University of Pittsburgh.

出版信息

Proc Int Conf Data Eng. 2017 Apr;2017:1525-1532. doi: 10.1109/ICDE.2017.223. Epub 2017 May 18.

DOI:10.1109/ICDE.2017.223

PMID:29422764

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5799807/

Abstract

The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.

摘要

高维生物数据的指数级增长导致对知识生产自动化方法的需求迅速增加。现有方法依靠两种通用方法来应对这一挑战：1）理论驱动方法，该方法利用先前积累的知识；2）数据驱动方法，该方法仅利用数据来推导科学知识。这两种方法单独使用都存在对过去/当前知识的偏见，因为它们未能纳入所有可用于做出新发现的现有知识。在本文中，我们展示了一种集成方法如何有效解决大型生物数据的高维度问题，这是纯数据驱动分析方法的一个主要问题。我们在一种新颖的两步分析工作流程中实现了我们的方法，该流程将一种新的特征选择范式作为处理高通量基因表达数据分析的第一步，并利用图形因果建模作为第二步来处理因果关系的自动提取。我们在来自癌症基因组图谱（TCGA）的真实临床数据集上的结果表明，我们的方法能够智能地选择基因以学习有效的因果网络。

相似文献

Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.基因表达数据分析中基于理论与数据驱动的综合特征选择

Proc Int Conf Data Eng. 2017 Apr;2017:1525-1532. doi: 10.1109/ICDE.2017.223. Epub 2017 May 18.

Knowledge-fused differential dependency network models for detecting significant rewiring in biological networks.用于检测生物网络中显著重连的知识融合差异依赖网络模型。

BMC Syst Biol. 2014 Jul 24;8:87. doi: 10.1186/s12918-014-0087-1.

Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer.比较五种监督特征选择算法，这些算法可从癌症的多组学数据中得到顶级特征和基因特征。

BMC Bioinformatics. 2022 Apr 28;23(Suppl 3):153. doi: 10.1186/s12859-022-04678-y.

Interaction-Based Feature Selection for Uncovering Cancer Driver Genes Through Copy Number-Driven Expression Level.基于相互作用的特征选择，通过拷贝数驱动的表达水平来揭示癌症驱动基因。

J Comput Biol. 2017 Feb;24(2):138-152. doi: 10.1089/cmb.2016.0140. Epub 2016 Oct 19.

Golden eagle based improved Att-BiLSTM model for big data classification with hybrid feature extraction and feature selection techniques.基于金鹰优化的Att-BiLSTM模型，用于结合混合特征提取和特征选择技术的大数据分类

Network. 2024 May;35(2):154-189. doi: 10.1080/0954898X.2023.2293895. Epub 2023 Dec 28.

Multilabel Feature Selection: A Local Causal Structure Learning Approach.多标签特征选择：一种局部因果结构学习方法。

IEEE Trans Neural Netw Learn Syst. 2023 Jun;34(6):3044-3057. doi: 10.1109/TNNLS.2021.3111288. Epub 2023 Jun 1.

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data.生物医学数据的理论与数据驱动建模的集成流水线。

IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun;18(3):811-822. doi: 10.1109/TCBB.2020.3019237. Epub 2021 Jun 3.

Tailored graphical lasso for data integration in gene network reconstruction.针对基因网络重构中数据集成的定制图形套索。

BMC Bioinformatics. 2021 Oct 15;22(1):498. doi: 10.1186/s12859-021-04413-z.

Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO.利用差异加权图形套索法，将先验生物学知识纳入基于网络的差异基因表达分析。

BMC Bioinformatics. 2017 Feb 10;18(1):99. doi: 10.1186/s12859-017-1515-1.

Data-Driven and Knowledge-Based Algorithms for Gene Network Reconstruction on High-Dimensional Data.基于数据驱动和知识的高维数据基因网络重建算法。

IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1545-1557. doi: 10.1109/TCBB.2020.3034861. Epub 2022 Jun 3.

引用本文的文献

Review of feature selection approaches based on grouping of features.基于特征分组的特征选择方法综述。

PeerJ. 2023 Jul 17;11:e15666. doi: 10.7717/peerj.15666. eCollection 2023.

Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer.基于 GAN 的分类器在乳腺癌转录组预后中的应用。

PLoS Comput Biol. 2023 Apr 3;19(4):e1011035. doi: 10.1371/journal.pcbi.1011035. eCollection 2023 Apr.

Trust in the scientific research community predicts intent to comply with COVID-19 prevention measures: An analysis of a large-scale international survey dataset.对科研共同体的信任可预测人们对遵守 COVID-19 预防措施的意愿：对大型国际调查数据集的分析。

Epidemiol Infect. 2022 Feb 8;150:e36. doi: 10.1017/S0950268822000255.

CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis.CogNet：基于面向排名活性子网的KEGG通路富集分析的基因表达数据分类

PeerJ Comput Sci. 2021 Feb 22;7:e336. doi: 10.7717/peerj-cs.336. eCollection 2021.

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data.基于生物领域知识的特征选择在基因表达数据中的应用。

Entropy (Basel). 2020 Dec 22;23(1):2. doi: 10.3390/e23010002.

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data.生物医学数据的理论与数据驱动建模的集成流水线。

IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun;18(3):811-822. doi: 10.1109/TCBB.2020.3019237. Epub 2021 Jun 3.

CausalMGM: an interactive web-based causal discovery tool.因果推断工具 CausalMGM：一款交互式网页工具

Nucleic Acids Res. 2020 Jul 2;48(W1):W597-W602. doi: 10.1093/nar/gkaa350.

Integrative Gene Selection on Gene Expression Data: Providing Biological Context to Traditional Approaches.基因表达数据的整合基因选择：为传统方法提供生物学背景。

J Integr Bioinform. 2018 Dec 22;16(1):20180064. doi: 10.1515/jib-2018-0064.

本文引用的文献

Learning mixed graphical models with separate sparsity parameters and stability-based model selection.学习具有单独稀疏参数和基于稳定性的模型选择的混合图形模型。

BMC Bioinformatics. 2016 Jun 6;17 Suppl 5(Suppl 5):175. doi: 10.1186/s12859-016-1039-0.

Causal discovery and inference: concepts and recent methodological advances.因果发现与推断：概念及近期方法进展

Appl Inform (Berl). 2016;3:3. doi: 10.1186/s40535-016-0018-x. Epub 2016 Feb 18.

Blockade of Wnt/β-catenin signaling suppresses breast cancer metastasis by inhibiting CSC-like phenotype.Wnt/β-连环蛋白信号通路的阻断通过抑制癌症干细胞样表型来抑制乳腺癌转移。

Sci Rep. 2015 Jul 23;5:12465. doi: 10.1038/srep12465.

Learning the Structure of Mixed Graphical Models.学习混合图形模型的结构

J Comput Graph Stat. 2015 Jan 1;24(1):230-253. doi: 10.1080/10618600.2014.900500.

MicroRNA expression profiling predicts clinical outcome of carboplatin/paclitaxel-based therapy in metastatic melanoma treated on the ECOG-ACRIN trial E2603.微小RNA表达谱可预测在ECOG-ACRIN试验E2603中接受基于卡铂/紫杉醇治疗的转移性黑色素瘤的临床结局。

Clin Epigenetics. 2015 Jun 4;7(1):58. doi: 10.1186/s13148-015-0092-2. eCollection 2015.

DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes.DisGeNET：一个用于动态探索人类疾病及其基因的发现平台。

Database (Oxford). 2015 Apr 15;2015:bav028. doi: 10.1093/database/bav028. Print 2015.

T-ReCS: stable selection of dynamically formed groups of features with application to prediction of clinical outcomes.T-ReCS：动态形成的特征组的稳定选择及其在临床结果预测中的应用

Pac Symp Biocomput. 2015;20:431-42.

Genenames.org: the HGNC resources in 2015.Genenames.org：2015年的HGNC资源。

Nucleic Acids Res. 2015 Jan;43(Database issue):D1079-85. doi: 10.1093/nar/gku1071. Epub 2014 Oct 31.

TOX3 mutations in breast cancer.乳腺癌中的 TOX3 突变。

PLoS One. 2013 Sep 19;8(9):e74102. doi: 10.1371/journal.pone.0074102. eCollection 2013.

Master regulators of FGFR2 signalling and breast cancer risk.成纤维细胞生长因子受体 2 信号通路的主要调控因子与乳腺癌风险

Nat Commun. 2013;4:2464. doi: 10.1038/ncomms3464.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验