Suppr超能文献

DHFS-ECM:基于双重启发式特征选择的集成分类模型设计,用于从基因组序列中识别竹种

DHFS-ECM: Design of a Dual Heuristic Feature Selection-based Ensemble Classification Model for the Identification of Bamboo Species from Genomic Sequences.

作者信息

Durge Aditi R, Shrimankar Deepti D

机构信息

Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India.

出版信息

Curr Genomics. 2024 May 31;25(3):185-201. doi: 10.2174/0113892029268176240125055419. Epub 2024 Feb 1.

Abstract

BACKGROUND

Analyzing genomic sequences plays a crucial role in understanding biological diversity and classifying Bamboo species. Existing methods for genomic sequence analysis suffer from limitations such as complexity, low accuracy, and the need for constant reconfiguration in response to evolving genomic datasets.

AIM

This study addresses these limitations by introducing a novel Dual Heuristic Feature Selection-based Ensemble Classification Model (DHFS-ECM) for the precise identification of Bamboo species from genomic sequences.

METHODS

The proposed DHFS-ECM method employs a Genetic Algorithm to perform dual heuristic feature selection. This process maximizes inter-class variance, leading to the selection of informative N-gram feature sets. Subsequently, intra-class variance levels are used to create optimal training and validation sets, ensuring comprehensive coverage of class-specific features. The selected features are then processed through an ensemble classification layer, combining multiple stratification models for species-specific categorization.

RESULTS

Comparative analysis with state-of-the-art methods demonstrate that DHFS-ECM achieves remarkable improvements in accuracy (9.5%), precision (5.9%), recall (8.5%), and AUC performance (4.5%). Importantly, the model maintains its performance even with an increased number of species classes due to the continuous learning facilitated by the Dual Heuristic Genetic Algorithm Model.

CONCLUSION

DHFS-ECM offers several key advantages, including efficient feature extraction, reduced model complexity, enhanced interpretability, and increased robustness and accuracy through the ensemble classification layer. These attributes make DHFS-ECM a promising tool for real-time clinical applications and a valuable contribution to the field of genomic sequence analysis.

摘要

背景

分析基因组序列在理解生物多样性和竹种分类中起着至关重要的作用。现有的基因组序列分析方法存在诸如复杂性、低准确性以及需要根据不断演变的基因组数据集进行持续重新配置等局限性。

目的

本研究通过引入一种基于双启发式特征选择的集成分类模型(DHFS - ECM)来解决这些局限性,以从基因组序列中精确识别竹种。

方法

所提出的DHFS - ECM方法采用遗传算法进行双启发式特征选择。这一过程最大化类间方差,从而选择信息丰富的N - gram特征集。随后,利用类内方差水平创建最优训练集和验证集,确保全面涵盖特定类别的特征。然后,将所选特征通过一个集成分类层进行处理,该层结合多个分层模型进行物种特异性分类。

结果

与现有最先进方法的比较分析表明,DHFS - ECM在准确率(提高9.5%)、精确率(提高5.9%)、召回率(提高8.5%)和AUC性能(提高4.5%)方面取得了显著改进。重要的是,由于双启发式遗传算法模型促进的持续学习,即使物种类别数量增加,该模型仍能保持其性能。

结论

DHFS - ECM具有几个关键优势,包括高效的特征提取、降低的模型复杂性、增强的可解释性以及通过集成分类层提高的鲁棒性和准确性。这些特性使DHFS - ECM成为实时临床应用的有前途的工具,并为基因组序列分析领域做出了有价值的贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc99/11288165/fc3d1066e88e/CG-25-185_F1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验