PRCFX-DT：一种基于图形的基因组序列特征选择与分类新方法。

PRCFX-DT: a new graph-based approach for feature selection and classification of genomic sequences.

作者信息

Khodaei Amin, Eskandari Sania, Sharifi Hadi, Mozaffari-Tazehkand Behzad

机构信息

Faculty of Electrical & Computer Engineering, University of Tabriz, Tabriz, Iran.

Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA.

出版信息

BMC Bioinformatics. 2025 Jun 17;26(1):159. doi: 10.1186/s12859-025-06183-4.

DOI:10.1186/s12859-025-06183-4

PMID:40528202

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12172359/

Abstract

BACKGROUND

In recent years, viral diseases have exhibited a significant incidence of infections and fatalities. The analysis of viral genomic sequences can be efficacious in evaluating the present and potentially forthcoming condition of viruses. Considering the importance of the internal structure of the cell and the nucleotide sequences within it, analyzing nucleotide sequences can provide a range of discussable features. On the other hand, it has been demonstrated that the use of graph algorithms and machine learning in the analysis and examination of virus samples and even viral variants can yield beneficial results.

RESULTS

This study proposes a novel approach that utilizes complex networks and probabilistic graph modeling methods to analyze viral genomic sequences for feature extraction. The proposed approach, which relies on the PageRank centrality algorithm, operates on codons that are associated with the nucleotide sequences. Experiments with machine learning algorithms were conducted on multiple datasets of viruses and various variants of coronavirus and influenza viruses. The use of a decision tree classifier model on the extracted distinguishing features enabled the differentiation of coronavirus samples from other samples. The high discriminative capability of the graph node centrality feature played a significant role in these experiments, establishing a meaningful connection with genetic concepts as well. The decision tree classifier applied on 173,228 genomic sequence samples originating from 30 distinct virus types, showed a remarkable accuracy rate of 99.73%.

CONCLUSION

The proposed algorithm was successfully tested on several types of viruses, and the interpretability of the extracted features also enabled its structural analysis. The use of a graph-based approach on genetic features containing information about the internal structure of nucleotides yielded results that could be significant for the identification of any type of virus or specific viral variant.

摘要

背景

近年来，病毒性疾病的感染率和死亡率显著上升。分析病毒基因组序列有助于评估病毒的当前状况以及潜在的未来发展态势。考虑到细胞内部结构及其核苷酸序列的重要性，分析核苷酸序列可揭示一系列值得探讨的特征。另一方面，已证明在病毒样本甚至病毒变体的分析和检测中使用图算法和机器学习能够产生有益结果。

结果

本研究提出了一种新颖的方法，利用复杂网络和概率图建模方法来分析病毒基因组序列以进行特征提取。该方法基于PageRank中心性算法，对与核苷酸序列相关的密码子进行操作。在多个病毒数据集以及冠状病毒和流感病毒的各种变体上进行了机器学习算法实验。在提取的显著特征上使用决策树分类器模型能够区分冠状病毒样本与其他样本。图节点中心性特征的高判别能力在这些实验中发挥了重要作用，也与遗传概念建立了有意义的联系。应用于来自30种不同病毒类型的173,228个基因组序列样本的决策树分类器显示出99.73%的显著准确率。