Suwayyid Faisal, Hozumi Yuta, Feng Hongsong, Zia Mushal, Wee JunJie, Wei Guo-Wei
Department of Mathematics, King Fahd University of Petroleum and Minerals, Dhahran 31261, KSA.
Department of Mathematics, Michigan State University, MI 48824, USA.
ArXiv. 2025 Aug 13:arXiv:2508.09406v1.
Despite the availability of various sequence analysis models, comparative genomic analysis remains a challenge in genomics, genetics, and phylogenetics. Commutative algebra, a fundamental tool in algebraic geometry and number theory, has rarely been used in data and biological sciences. In this study, we introduce commutative algebra k-mer learning (CAKL) as the first-ever nonlinear algebraic framework for analyzing genomic sequences. CAKL bridges between commutative algebra, algebraic topology, combinatorics, and machine learning to establish a new mathematical paradigm for comparative genomic analysis. We evaluate its effectiveness on three tasks-genetic variant identification, phylogenetic tree analysis, and viral genome classification-typically requiring alignment-based, alignment-free, and machine-learning approaches, respectively. Across eleven datasets, CAKL outperforms five state-of-the-art sequence analysis methods, particularly in viral classification, and maintains stable predictive accuracy as dataset size increases, underscoring its scalability and robustness. This work ushers in a new era in commutative algebraic data analysis and learning.
尽管有各种序列分析模型,但比较基因组分析在基因组学、遗传学和系统发育学中仍然是一项挑战。交换代数作为代数几何和数论中的一个基本工具,在数据和生物科学中很少被使用。在本研究中,我们引入交换代数k-mer学习(CAKL),这是首个用于分析基因组序列的非线性代数框架。CAKL在交换代数、代数拓扑、组合学和机器学习之间架起桥梁,为比较基因组分析建立了一种新的数学范式。我们在三个任务上评估其有效性——基因变异识别、系统发育树分析和病毒基因组分类,这些任务通常分别需要基于比对、无比对和机器学习方法。在11个数据集中,CAKL优于五种最先进的序列分析方法,特别是在病毒分类方面,并且随着数据集规模的增加保持稳定的预测准确性,突出了其可扩展性和稳健性。这项工作开创了交换代数数据分析和学习的新纪元。