Jaiswal Saish, Murthy Hema A, Narayanan Manikandan
Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai 600036, India.
Department of Computer Science and Engineering, Shiv Nadar University, Chennai 603110, India.
Bioinform Adv. 2024 Nov 5;4(1):vbae171. doi: 10.1093/bioadv/vbae171. eCollection 2024.
Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.
We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.
SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.
基因组信号处理(GSP)将生物分子序列转换为离散信号以进行光谱分析,为DNA序列、结构和进化提供了有价值的见解。然而,对于物种分类等任务以及在解释这些光谱以识别有区分性的DNA区域时,可变长度序列的光谱表示仍然存在挑战。
我们引入了SpecGMM,这是一个新颖的框架,它将基于滑动窗口的光谱分析与高斯混合模型相结合,将可变长度的DNA序列转换为固定维度的光谱表示用于分类学分类。SpecGMM的超参数是使用植物序列数据集选择的,并在包括线粒体DNA、病毒和细菌基因组以及16S rRNA序列在内的各种数据集中保持不变应用。在这些数据集中,SpecGMM优于基线方法,对于线性判别分类器,测试准确率平均提高9.45%,最大提高35.55%。关于可解释性,SpecGMM揭示了16S rRNA序列中的有区分性的高变区域——特别是用于区分高级分类群的V3/V4和用于区分低级分类群的V2/V3——证实了它们已知的分类相关性。SpecGMM的频谱图视频分析有助于可视化物种特异性的DNA特征。因此,SpecGMM为光谱DNA分析提供了一种强大且可解释的方法,为GSP研究开辟了新途径。
SpecGMM的源代码可在https://github.com/BIRDSgroup/SpecGMM上获取。