超高维基因组数据中基于深度学习分类的特征选择策略

Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data.

作者信息

Kotlarz Krzysztof, Słomian Dawid, Zawadzka Weronika, Szyda Joanna

机构信息

Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, 51-631 Wroclaw, Poland.

National Research Institute of Animal Production, 32-083 Balice, Poland.

出版信息

Int J Mol Sci. 2025 Aug 18;26(16):7961. doi: 10.3390/ijms26167961.

DOI:10.3390/ijms26167961

PMID:40869283

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12386708/

Abstract

The advancement of high-throughput sequencing has revolutionised genomic research by generating large amounts of data. However, Whole-Genome Sequencing is associated with a statistical challenge known as the p >> n problem. We classified 1825 individuals into five breeds based on 11,915,233 SNPs. First, three feature selection algorithms were applied: SNP-tagging and two approaches based on supervised rank aggregation, followed by either one-dimensional (1D-SRA) or multidimensional (MD-SRA) feature clustering. Individuals were then classified into breeds using a deep learning classifier composed of Convolutional Neural Networks. SNPs selected by SNP-tagging yielded the least satisfactory F1-score (86.87%); however, this approach offered rapid computing time. The 1D-SRA was less suitable for ultra-high-dimensional data due to computational, memory, and storage limitations. However, the SNP set selected by this algorithm provided the best classification quality (96.81%). MD-SRA provided a good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time, 14x lower data storage). Unlike SNP-tagging, SRA-based approaches are universal and are not limited to genomic data. This study addressed the demand for efficient computational and statistical tools for feature selection in high-dimensional genomic data. The results demonstrate that the proposed MD-SRA is suitable for the classification of high-dimensional data.

摘要

高通量测序的发展通过生成大量数据彻底改变了基因组研究。然而，全基因组测序面临着一个名为p >> n问题的统计挑战。我们根据11,915,233个单核苷酸多态性（SNP）将1825个个体分为五个品种。首先，应用了三种特征选择算法：SNP标签法以及基于监督秩聚合的两种方法，随后进行一维（1D-SRA）或多维（MD-SRA）特征聚类。然后使用由卷积神经网络组成的深度学习分类器将个体分类到各个品种中。通过SNP标签法选择的SNP产生的F1分数最不理想（86.87%）；然而，这种方法计算时间较快。由于计算、内存和存储限制，1D-SRA不太适合超高维数据。然而，该算法选择的SNP集提供了最佳的分类质量（96.81%）。MD-SRA在分类质量（95.12%）和计算效率（分析时间降低17倍，数据存储降低14倍）之间实现了良好的平衡。与SNP标签法不同，基于SRA的方法具有通用性，不限于基因组数据。本研究满足了对高维基因组数据特征选择的高效计算和统计工具的需求。结果表明，所提出的MD-SRA适用于高维数据的分类。