Richter Michael, Admasu Alem
Department of Chemistry, Binghamton University, Binghamton, NY 13902, USA.
Department of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA.
Int J Mol Sci. 2025 Jul 16;26(14):6795. doi: 10.3390/ijms26146795.
Chemical modifications are the standard for small interfering RNAs (siRNAs) in therapeutic applications, but predicting their off-target effects remains a significant challenge. Current approaches often rely on sequence-based encodings, which fail to fully capture the structural and protein-RNA interaction details critical for off-target prediction. In this study, we developed a framework to generate reproducible structure-based chemical features, incorporating both molecular fingerprints and computationally derived siRNA-hAgo2 complex structures. Using an RNA-Seq off-target study, we generated over 30,000 siRNA-gene data points and systematically compared nine distinct types of feature representation strategies. Among the datasets, the highest predictive performance was achieved by Dataset 3, which used extended connectivity fingerprints (ECFPs) to encode siRNA and mRNA features. An energy-minimized dataset (7R), representing siRNA-hAgo2 structural alignments, was the second-best performer, underscoring the value of incorporating reproducible structural information into feature engineering. Our findings demonstrate that combining detailed structural representations with sequence-based features enables the generation of robust, reproducible chemical features for machine learning models, offering a promising path forward for off-target prediction and siRNA therapeutic design that can be seamlessly extended to include any modification, such as clinically relevant 2'-F or 2'-OMe.
化学修饰是治疗应用中小干扰RNA(siRNA)的标准做法,但预测其脱靶效应仍然是一项重大挑战。当前的方法通常依赖基于序列的编码,而这种编码无法充分捕捉对脱靶预测至关重要的结构和蛋白质-RNA相互作用细节。在本研究中,我们开发了一个框架,以生成可重复的基于结构的化学特征,同时纳入分子指纹和通过计算得出的siRNA-hAgo2复合物结构。通过一项RNA测序脱靶研究,我们生成了超过30000个siRNA-基因数据点,并系统地比较了九种不同类型的特征表示策略。在这些数据集中,数据集3实现了最高的预测性能,该数据集使用扩展连接指纹(ECFP)来编码siRNA和mRNA特征。一个代表siRNA-hAgo2结构比对的能量最小化数据集(7R)是第二好的表现者,这突出了将可重复的结构信息纳入特征工程的价值。我们的研究结果表明,将详细的结构表示与基于序列的特征相结合,能够为机器学习模型生成强大、可重复的化学特征,为脱靶预测和siRNA治疗设计提供了一条有前景的道路,并且可以无缝扩展到包括任何修饰,如临床相关的2'-F或2'-OMe。