Suppr超能文献

利用生物实验数据和分子动力学,通过机器学习对突变热点进行分类。

Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning.

作者信息

Davies James G, Menzies Georgina E

机构信息

Molecular Bioscience Division, School of Biosciences, Cardiff University, Cardiff, CF10 3AX, United Kingdom.

出版信息

Bioinform Adv. 2024 Aug 26;4(1):vbae125. doi: 10.1093/bioadv/vbae125. eCollection 2024.

Abstract

MOTIVATION

Benzo[]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[]pyrene Diol-Epoxide (BPDE), a Benzo[]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the gene, then applied to sites within , , and genes.

RESULTS

We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among and duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.

AVAILABILITY AND IMPLEMENTATION

Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.

摘要

动机

苯并[a]芘是一种臭名昭著的DNA损伤致癌物,属于多环芳烃家族,常见于烟草烟雾中。令人惊讶的是,核苷酸切除修复(NER)机制在识别包括苯并[a]芘二醇环氧化物(BPDE,一种苯并[a]芘代谢物)在内的特定大体积DNA加合物时效率低下。虽然序列背景正成为将NER对BPDE加合物反应不足联系起来的主要因素,但控制这些差异的精确结构属性仍未得到充分理解。因此,我们结合分子动力学和机器学习领域,对多个基因背景下BPDE-鸟嘌呤加合物引起的螺旋扭曲进行了全面评估。具体而言,我们实施了一种双重方法,包括基于随机森林分类的分析和随后的特征选择,以识别可能区分具有可变修复能力的加合物位点的精确拓扑特征。我们的模型使用从代表基因内BPDE热点和非热点位点的双链体中提取的螺旋数据进行训练,然后应用于基因、和中的位点。

结果

我们表明,我们优化后的模型始终表现出卓越的性能,准确率、精确率和F1分数均超过91%。我们的特征选择方法发现,区域碱基对旋转的可辨别差异在为我们的模型决策提供信息方面起着关键作用。值得注意的是,这些差异在和双链体中高度保守,并且似乎受到区域GC含量的影响。因此,我们的研究结果表明,确实存在区分热点和非热点位点的保守拓扑特征,突出了区域GC含量作为突变潜在生物标志物的作用。

可用性和实现方式

用于比较机器学习分类器并评估其性能的代码可在https://github.com/jdavies24/ML-Classifier-Comparison上获取,使用随机森林通过Curves+和Canal分析DNA结构的代码可在https://github.com/jdavies24/ML-classification-of-DNA-trajectories上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e82d/11377099/782e526e4cdb/vbae125f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验