Papastathopoulos-Katsaros Athanasios, Liu Zhandong
Department of Pediatrics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX, 77030, United States of America.
Data Science Center, Jan and Dan Duncan Neurological Research Insitute, 1250 Moursund Street, Houston, TX, 77030, United States of America.
bioRxiv. 2025 Jun 7:2025.06.03.657533. doi: 10.1101/2025.06.03.657533.
Alignment-based methods are fundamental for sequence comparison but are often computationally prohibitive for large-scale genomic analyses. This limitation has spurred the development of quicker, alignment-free alternatives, such as k-mer analysis, which are crucial for studying long non-coding ribonucleic acids (lncRNAs) in plants. These lncRNAs play critical roles in regulating gene expression at both the epigenetic and transcriptomic levels. However, existing alignment-free approaches typically lose positional information, which can be vital for achieving accurate classification.
We propose positional frequency chaos game representation (PFCGR), a novel encoding that improves the traditional frequency chaos game representation (FCGR) by incorporating four statistical moments of k-mer positions: mean, standard deviation, skewness, and kurtosis. This creates a multi-channel image representation of genomic sequences, enabling machine learning models such as Logistic Regression, Random Forests, and Convolutional Neural Networks to classify plant lncRNAs directly from raw genomic sequences. Tested on seven major crop species, our PFCGR-based classifiers achieve classification accuracies comparable to or exceeding those of the computationally intensive DNABERT-based model (Danilevicz et al. (1)), while requiring significantly less computational resources. These results demonstrate PFCGR's potential as an efficient and accurate tool for plant lncRNA identification, as well as its ability to facilitate large-scale computational studies in genomics.
基于比对的方法是序列比较的基础,但对于大规模基因组分析来说,其计算成本往往过高。这一局限性促使了更快的、无需比对的替代方法的发展,例如k-mer分析,这对于研究植物中的长链非编码核糖核酸(lncRNA)至关重要。这些lncRNA在表观遗传和转录组水平上调节基因表达方面发挥着关键作用。然而,现有的无需比对的方法通常会丢失位置信息,而这对于实现准确分类可能至关重要。
我们提出了位置频率混沌博弈表示法(PFCGR),这是一种新颖的编码方法,通过纳入k-mer位置的四个统计矩:均值、标准差、偏度和峰度,改进了传统的频率混沌博弈表示法(FCGR)。这创建了基因组序列的多通道图像表示,使诸如逻辑回归、随机森林和卷积神经网络等机器学习模型能够直接从原始基因组序列中对植物lncRNA进行分类。在七种主要作物物种上进行测试时,我们基于PFCGR的分类器实现的分类准确率与计算量大的基于DNABERT的模型(Danilevicz等人(1))相当或更高,同时所需的计算资源要少得多。这些结果证明了PFCGR作为一种高效且准确的植物lncRNA识别工具的潜力,以及它在促进基因组学大规模计算研究方面的能力。