Suppr超能文献

AlzGenPred - 基于CatBoost的基因分类器,用于利用高通量测序数据预测阿尔茨海默病。

AlzGenPred - CatBoost-based gene classifier for predicting Alzheimer's disease using high-throughput sequencing data.

作者信息

Shukla Rohit, Singh Tiratha Raj

机构信息

Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology (JUIT), Waknaghat, Solan, 173234, H.P., India.

Center of Excellence for Aging and Brain Repair, Morsani College of Medicine, University of South Florida, Tampa, 33613, FL, USA.

出版信息

Sci Rep. 2024 Dec 5;14(1):30294. doi: 10.1038/s41598-024-82208-x.

Abstract

AD is a progressive neurodegenerative disorder characterized by memory loss. Due to the advancement in next-generation sequencing, an enormous amount of AD-associated genomics data is available. However, the information about the involvement of these genes in AD association is still a research topic. Therefore, AlzGenPred is developed to identify the AD-associated genes using machine-learning. A total of 13,504 features derived from eight sequence-encoding schemes were generated and evaluated using 16 machine learning algorithms. Network-based features significantly outperformed sequence-based features, effectively distinguishing AD-associated genes. In contrast, sequence-based features failed to classify accurately. To improve performance, we generated 24 fused features (6020 D) from sequence-based encodings, increasing accuracy by 5-7% using a two-step lightGBM-based recursive feature selection method. However, accuracy remained below 70% even after hyperparameter tuning. Therefore, network-based features were used to generate the CatBoost-based ML method AlzGenPred with 96.55% accuracy and 98.99% AUROC. The developed method is tested on the AlzGene dataset where it showed 96.43% accuracy. Then the model was validated using the transcriptomics dataset. AlzGenPred provides a reliable and user-friendly tool for identifying potential AD biomarkers, accelerating biomarker discovery, and advancing our understanding of AD. It is available at https://www.bioinfoindia.org/alzgenpred/ and https://github.com/shuklarohit815/AlzGenPred .

摘要

阿尔茨海默病(AD)是一种以记忆丧失为特征的进行性神经退行性疾病。由于下一代测序技术的进步,大量与AD相关的基因组学数据得以获取。然而,这些基因在AD关联中的作用信息仍是一个研究课题。因此,开发了AlzGenPred来使用机器学习识别与AD相关的基因。从八种序列编码方案中总共生成了13504个特征,并使用16种机器学习算法进行了评估。基于网络的特征显著优于基于序列的特征,能有效区分与AD相关的基因。相比之下,基于序列的特征未能准确分类。为了提高性能,我们从基于序列的编码中生成了24个融合特征(6020维),使用基于lightGBM的两步递归特征选择方法,准确率提高了5 - 7%。然而,即使经过超参数调整,准确率仍低于70%。因此,基于网络的特征被用于生成基于CatBoost的机器学习方法AlzGenPred,其准确率为96.55%,曲线下面积(AUROC)为98.99%。所开发的方法在AlzGene数据集上进行了测试,显示准确率为96.43%。然后使用转录组学数据集对模型进行了验证。AlzGenPred为识别潜在的AD生物标志物、加速生物标志物发现以及增进我们对AD的理解提供了一个可靠且用户友好的工具。它可在https://www.bioinfoindia.org/alzgenpred/https://github.com/shuklarohit815/AlzGenPred 获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c5f/11621786/c90ef8d422aa/41598_2024_82208_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验