Suppr超能文献

使用基于因果关系的特征选择和机器学习进行肌萎缩侧索硬化症的基因靶向治疗。

Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning.

机构信息

Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY, 11549, USA.

Institute of Molecular Medicine, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, 11030, USA.

出版信息

Mol Med. 2023 Jan 24;29(1):12. doi: 10.1186/s10020-023-00603-y.

Abstract

BACKGROUND

Amyotrophic lateral sclerosis (ALS) is a rare progressive neurodegenerative disease that affects upper and lower motor neurons. As the molecular basis of the disease is still elusive, the development of high-throughput sequencing technologies, combined with data mining techniques and machine learning methods, could provide remarkable results in identifying pathogenetic mechanisms. High dimensionality is a major problem when applying machine learning techniques in biomedical data analysis, since a huge number of features is available for a limited number of samples. The aim of this study was to develop a methodology for training interpretable machine learning models in the classification of ALS and ALS-subtypes samples, using gene expression datasets.

METHODS

We performed dimensionality reduction in gene expression data using a semi-automated preprocessing systematic gene selection procedure using Statistically Equivalent Signature (SES), a causality-based feature selection algorithm, followed by Boosted Regression Trees (XGBoost) and Random Forest to train the machine learning classifiers. The SHapley Additive exPlanations (SHAP values) were used for interpretation of the machine learning classifiers. The methodology was developed and tested using two distinct publicly available ALS RNA-seq datasets. We evaluated the performance of SES as a dimensionality reduction method against: (a) Least Absolute Shrinkage and Selection Operator (LASSO), and (b) Local Outlier Factor (LOF).

RESULTS

The proposed methodology achieved 85.18% accuracy for the classification of cerebellum or frontal cortex samples as C9orf72-related familial ALS, sporadic ALS or healthy samples. Importantly, the genes identified as the most determinative have also been reported as disease-associated in ALS literature. When tested in the evaluation dataset, the methodology achieved 88.89% accuracy for the classification of sporadic ALS motor neuron samples. When LASSO was used as feature selection method instead of SES, the accuracy of the machine learning classifiers ranged from 74.07 to 96.30%, depending on tissue assessed, while LOF underperformed significantly (77.78% accuracy for the classification of pooled cerebellum and frontal cortex samples).

CONCLUSIONS

Using SES, we addressed the challenge of high dimensionality in gene expression data analysis, and we trained accurate machine learning ALS classifiers, specific for the gene expression patterns of different disease subtypes and tissue samples, while identifying disease-associated genes.

摘要

背景

肌萎缩侧索硬化症(ALS)是一种罕见的进行性神经退行性疾病,影响上下运动神经元。由于疾病的分子基础仍难以捉摸,高通量测序技术的发展,结合数据挖掘技术和机器学习方法,在识别发病机制方面可能会取得显著成果。在生物医学数据分析中应用机器学习技术时,高维度是一个主要问题,因为对于有限数量的样本,可用的特征数量巨大。本研究的目的是开发一种使用基因表达数据集对 ALS 和 ALS 亚型样本进行分类的可解释机器学习模型的方法。

方法

我们使用基于统计等效签名(SES)的半自动预处理系统基因选择程序对基因表达数据进行降维,SES 是一种基于因果关系的特征选择算法,然后使用 Boosted Regression Trees(XGBoost)和随机森林训练机器学习分类器。使用 Shapley Additive exPlanations(SHAP 值)对机器学习分类器进行解释。该方法是使用两个不同的公开可用的 ALS RNA-seq 数据集开发和测试的。我们评估了 SES 作为降维方法的性能,与:(a)最小绝对值收缩和选择算子(LASSO)和(b)局部离群因子(LOF)相比。

结果

该方法在对小脑或额叶样本进行分类时达到了 85.18%的准确率,将其分为 C9orf72 相关家族性 ALS、散发性 ALS 或健康样本。重要的是,被确定为最具决定性的基因也已在 ALS 文献中报告为与疾病相关。在评估数据集上进行测试时,该方法在对散发性 ALS 运动神经元样本进行分类时达到了 88.89%的准确率。当 LASSO 用作特征选择方法而不是 SES 时,机器学习分类器的准确率范围为 74.07%至 96.30%,具体取决于评估的组织,而 LOF 的表现明显较差(对小脑和额叶样本的分类准确率为 77.78%)。

结论

使用 SES,我们解决了基因表达数据分析中高维度的挑战,并训练了针对不同疾病亚型和组织样本的基因表达模式的准确机器学习 ALS 分类器,同时确定了与疾病相关的基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ddff/9872307/3738dff74aee/10020_2023_603_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验