Suppr超能文献

基于 PyFeat 和梯度提升决策树预测帕金森病相关基因。

Predicting Parkinson disease related genes based on PyFeat and gradient boosted decision tree.

机构信息

Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, 35516, Egypt.

出版信息

Sci Rep. 2022 Jun 15;12(1):10004. doi: 10.1038/s41598-022-14127-8.

Abstract

Identifying genes related to Parkinson's disease (PD) is an active research topic in biomedical analysis, which plays a critical role in diagnosis and treatment. Recently, many studies have proposed different techniques for predicting disease-related genes. However, a few of these techniques are designed or developed for PD gene prediction. Most of these PD techniques are developed to identify only protein genes and discard long noncoding (lncRNA) genes, which play an essential role in biological processes and the transformation and development of diseases. This paper proposes a novel prediction system to identify protein and lncRNA genes related to PD that can aid in an early diagnosis. First, we preprocessed the genes into DNA FASTA sequences from the University of California Santa Cruz (UCSC) genome browser and removed the redundancies. Second, we extracted some significant features of DNA FASTA sequences using the PyFeat method with the AdaBoost as feature selection. These selected features achieved promising results compared with extracted features from some state-of-the-art feature extraction techniques. Finally, the features were fed to the gradient-boosted decision tree (GBDT) to diagnose different tested cases. Seven performance metrics were used to evaluate the performance of the proposed system. The proposed system achieved an average accuracy of 78.6%, the area under the curve equals 84.5%, the area under precision-recall (AUPR) equals 85.3%, F1-score equals 78.3%, Matthews correlation coefficient (MCC) equals 0.575, sensitivity (SEN) equals 77.1%, and specificity (SPC) equals 80.2%. The experiments demonstrate promising results compared with other systems. The predicted top-rank protein and lncRNA genes are verified based on a literature review.

摘要

识别与帕金森病(PD)相关的基因是生物医学分析中的一个活跃研究课题,它在诊断和治疗中起着关键作用。最近,许多研究提出了不同的技术来预测疾病相关基因。然而,其中一些技术是专门为 PD 基因预测设计或开发的。这些 PD 技术中的大多数是为了识别仅与蛋白质基因有关而开发的,而忽略了在生物过程以及疾病的转化和发展中起着重要作用的长非编码(lncRNA)基因。本文提出了一种新的预测系统,用于识别与 PD 相关的蛋白质和 lncRNA 基因,以帮助进行早期诊断。首先,我们从加利福尼亚大学圣克鲁斯分校(UCSC)基因组浏览器中将基因预处理成 DNA FASTA 序列,并去除冗余。其次,我们使用 PyFeat 方法和 AdaBoost 作为特征选择来提取 DNA FASTA 序列的一些重要特征。与从一些最先进的特征提取技术中提取的特征相比,这些选择的特征取得了有希望的结果。最后,将特征输入梯度提升决策树(GBDT)以诊断不同的测试案例。使用七个性能指标来评估所提出系统的性能。所提出的系统实现了平均准确率为 78.6%,曲线下面积等于 84.5%,精度-召回率(AUPR)下面积等于 85.3%,F1 分数等于 78.3%,马修斯相关系数(MCC)等于 0.575,灵敏度(SEN)等于 77.1%,特异性(SPC)等于 80.2%。与其他系统相比,实验结果表明了有希望的结果。根据文献综述验证了预测的顶级蛋白质和 lncRNA 基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3704/9200794/55711b6ec6e7/41598_2022_14127_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验