Davies James G, Menzies Georgina E
Molecular Bioscience Division, School of Biosciences, Cardiff University, Cardiff, CF10 3AX, United Kingdom.
Bioinform Adv. 2024 Aug 26;4(1):vbae125. doi: 10.1093/bioadv/vbae125. eCollection 2024.
Benzo[]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[]pyrene Diol-Epoxide (BPDE), a Benzo[]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the gene, then applied to sites within , , and genes.
We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among and duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.
Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.
苯并[a]芘是一种臭名昭著的DNA损伤致癌物,属于多环芳烃家族,常见于烟草烟雾中。令人惊讶的是,核苷酸切除修复(NER)机制在识别包括苯并[a]芘二醇环氧化物(BPDE,一种苯并[a]芘代谢物)在内的特定大体积DNA加合物时效率低下。虽然序列背景正成为将NER对BPDE加合物反应不足联系起来的主要因素,但控制这些差异的精确结构属性仍未得到充分理解。因此,我们结合分子动力学和机器学习领域,对多个基因背景下BPDE-鸟嘌呤加合物引起的螺旋扭曲进行了全面评估。具体而言,我们实施了一种双重方法,包括基于随机森林分类的分析和随后的特征选择,以识别可能区分具有可变修复能力的加合物位点的精确拓扑特征。我们的模型使用从代表基因内BPDE热点和非热点位点的双链体中提取的螺旋数据进行训练,然后应用于基因、和中的位点。
我们表明,我们优化后的模型始终表现出卓越的性能,准确率、精确率和F1分数均超过91%。我们的特征选择方法发现,区域碱基对旋转的可辨别差异在为我们的模型决策提供信息方面起着关键作用。值得注意的是,这些差异在和双链体中高度保守,并且似乎受到区域GC含量的影响。因此,我们的研究结果表明,确实存在区分热点和非热点位点的保守拓扑特征,突出了区域GC含量作为突变潜在生物标志物的作用。
用于比较机器学习分类器并评估其性能的代码可在https://github.com/jdavies24/ML-Classifier-Comparison上获取,使用随机森林通过Curves+和Canal分析DNA结构的代码可在https://github.com/jdavies24/ML-classification-of-DNA-trajectories上获取。