Fu Chengbo, Niskanen Einari A, Wei Gong-Hong, Yang Zhirong, Sanvicente-García Marta, Güell Marc, Cheng Lu
Department of Computer Science, School of Science, Aalto University, 02150 Espoo, Finland.
Institute of Biomedicine, University of Eastern Finland, 70211 Kuopio, Finland.
Genome Res. 2025 May 2;35(5):1234-1246. doi: 10.1101/gr.279458.124.
Identifying and illustrating patterns in DNA sequences are crucial tasks in various biological data analyses. In this task, patterns are often represented by sets of -mers, the fundamental building blocks of DNA sequences. To visually unveil these patterns, one could project each -mer onto a point in two-dimensional (2D) space. However, this projection poses challenges owing to the high-dimensional nature of -mers and their unique mathematical properties. Here, we establish a mathematical system to address the peculiarities of the -mer manifold. Leveraging this -mer manifold theory, we develop a statistical method named KMAP for detecting -mer patterns and visualizing them in 2D space. We applied KMAP to three distinct data sets to showcase its utility. KMAP achieves a comparable performance to the classical method MEME, with ∼90% similarity in motif discovery from HT-SELEX data. In the analysis of H3K27ac ChIP-seq data from Ewing sarcoma (EWS), we find that BACH1, OTX2, and KNCH2 might affect EWS prognosis by binding to promoter and enhancer regions across the genome. We also observe potential colocalization of BACH1, OTX2, and the motif CCCAGGCTGGAGTGC in ∼70 bp windows in the enhancer regions. Furthermore, we find that FLI1 binds to the enhancer regions after ETV6 degradation, indicating competitive binding between ETV6 and FLI1. Moreover, KMAP identifies four prevalent patterns in gene editing data of the AAVS1 locus, aligning with findings reported in the literature. These applications underscore that KMAP can be a valuable tool across various biological contexts.
识别和阐释DNA序列中的模式是各种生物数据分析中的关键任务。在这项任务中,模式通常由 - 聚体集合表示,- 聚体是DNA序列的基本构建块。为了直观地揭示这些模式,可以将每个 - 聚体投影到二维(2D)空间中的一个点上。然而,由于 - 聚体的高维性质及其独特的数学特性,这种投影带来了挑战。在这里,我们建立了一个数学系统来解决 - 聚体流形的特殊性。利用这种 - 聚体流形理论,我们开发了一种名为KMAP的统计方法,用于检测 - 聚体模式并在2D空间中进行可视化。我们将KMAP应用于三个不同的数据集以展示其效用。KMAP与经典方法MEME的性能相当,从HT - SELEX数据中发现基序的相似度约为90%。在对尤因肉瘤(EWS)的H3K27ac ChIP - seq数据的分析中,我们发现BACH1、OTX2和KNCH2可能通过结合全基因组的启动子和增强子区域来影响EWS的预后。我们还观察到在增强子区域约70 bp的窗口中,BACH1、OTX2和基序CCCAGGCTGGAGTGC存在潜在的共定位。此外,我们发现ETV6降解后FLI1与增强子区域结合,表明ETV6和FLI1之间存在竞争性结合。此外,KMAP在AAVS1位点的基因编辑数据中识别出四种普遍模式,与文献报道的结果一致。这些应用强调了KMAP在各种生物学背景下都可以成为一种有价值的工具。