Remori Veronica, Prest Michela, Fasano Mauro
Department of Science and High Technology, University of Insubria, Como, Italy.
Center of Neuroscience Research, University of Insubria, Busto Arsizio, Italy.
Front Bioinform. 2025 Aug 12;5:1657841. doi: 10.3389/fbinf.2025.1657841. eCollection 2025.
i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.
To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.
This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.
This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.
i-基序(iMs)是富含胞嘧啶的四链DNA结构,在基因调控和基因组稳定性中发挥着越来越重要的作用。尽管它们具有生物学相关性,但全基因组范围内形成i-基序序列的预测仍受到低特异性和高假阳性率的限制,导致实验负担相当大。
为了解决这个问题,我们开发了一种改进的计算方法,该方法使用从多序列比对中得出的位置特异性相似性矩阵(PSSM)对高可信度的i-基序候选序列进行优先级排序。使用针对富含胞嘧啶基序的自定义正则表达式扫描人类参考基因组(hg38),然后用PSSM对每个序列进行评分。通过置换检验、单侧t检验、Benjamini-Hochberg校正和Z分数评估统计显著性。
该流程鉴定出37,075个具有强大i-基序形成潜力的候选序列(15 - 46个核苷酸)。与经实验确认的i-基序和已知的G-四链体(G4s)进行验证,结果表明在比对分数和序列相似性方面存在显著差异,证实了结构特异性。基于核苷酸特征训练的随机森林分类器进一步支持了候选序列的独特性,实现了较高的分类性能。
这项工作提出了一种可扩展且统计稳健的方法来富集生物学相关的i-基序序列,为未来的实验验证以及在癌症等背景下合理设计靶向i-基序以调节基因表达的配体提供了宝贵资源。