Suppr超能文献

利用图卷积神经网络和逐层相关性传播进行稳定特征选择,以发现乳腺癌的生物标志物。

Stable feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation for biomarker discovery in breast cancer.

机构信息

Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany.

Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany; Medical Statistics, University Medical Center Göttingen, Humboldtallee 32, Göttingen, 37073, Germany; Scientific Core Facility Medical Biometry and Statistical Bioinformatics, University Medical Center Göttingen, Humboldtallee 32, Göttingen, 37073, Germany.

出版信息

Artif Intell Med. 2024 May;151:102840. doi: 10.1016/j.artmed.2024.102840. Epub 2024 Mar 11.

Abstract

High-throughput technologies are becoming increasingly important in discovering prognostic biomarkers and in identifying novel drug targets. With Mammaprint, Oncotype DX, and many other prognostic molecular signatures breast cancer is one of the paradigmatic examples of the utility of high-throughput data to deliver prognostic biomarkers, that can be represented in a form of a rather short gene list. Such gene lists can be obtained as a set of features (genes) that are important for the decisions of a Machine Learning (ML) method applied to high-dimensional gene expression data. Several studies have identified predictive gene lists for patient prognosis in breast cancer, but these lists are unstable and have only a few genes in common. Instability of feature selection impedes biological interpretability: genes that are relevant for cancer pathology should be members of any predictive gene list obtained for the same clinical type of patients. Stability and interpretability of selected features can be improved by including information on molecular networks in ML methods. Graph Convolutional Neural Network (GCNN) is a contemporary deep learning approach applicable to gene expression data structured by a prior knowledge molecular network. Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) are methods to explain individual decisions of deep learning models. We used both GCNN+LRP and GCNN+SHAP techniques to construct feature sets by aggregating individual explanations. We suggest a methodology to systematically and quantitatively analyze the stability, the impact on the classification performance, and the interpretability of the selected feature sets. We used this methodology to compare GCNN+LRP to GCNN+SHAP and to more classical ML-based feature selection approaches. Utilizing a large breast cancer gene expression dataset we show that, while feature selection with SHAP is useful in applications where selected features have to be impactful for classification performance, among all studied methods GCNN+LRP delivers the most stable (reproducible) and interpretable gene lists.

摘要

高通量技术在发现预后生物标志物和鉴定新的药物靶点方面变得越来越重要。Mammaprint、Oncotype DX 和许多其他预后分子特征就是将高通量数据用于提供预后生物标志物的典范例子之一,这些生物标志物可以用一个相当短的基因列表来表示。这样的基因列表可以作为机器学习 (ML) 方法应用于高维基因表达数据的决策的一组重要特征(基因)来获得。已有多项研究确定了乳腺癌患者预后的预测基因列表,但这些列表不稳定,且只有少数基因相同。特征选择的不稳定性妨碍了生物学可解释性:对于癌症病理学相关的基因应该是为相同临床类型的患者获得的任何预测基因列表的成员。通过在 ML 方法中纳入有关分子网络的信息,可以提高所选特征的稳定性和可解释性。图卷积神经网络 (GCNN) 是一种适用于基于先验知识分子网络构建的基因表达数据的现代深度学习方法。逐层相关性传播 (LRP) 和 Shapley 可加性解释 (SHAP) 是解释深度学习模型个别决策的方法。我们使用 GCNN+LRP 和 GCNN+SHAP 技术通过聚合个体解释来构建特征集。我们提出了一种系统且定量分析所选特征集的稳定性、对分类性能的影响和可解释性的方法。我们使用该方法将 GCNN+LRP 与 GCNN+SHAP 以及更经典的基于 ML 的特征选择方法进行了比较。利用大型乳腺癌基因表达数据集,我们表明,虽然 SHAP 的特征选择在所选特征对分类性能有影响的应用中很有用,但在所有研究方法中,GCNN+LRP 提供了最稳定(可重现)和可解释的基因列表。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验