MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, OX3 9DS, UK.
Sci Data. 2024 Aug 22;11(1):911. doi: 10.1038/s41597-024-03772-5.
We are witnessing a steep increase in model development initiatives in genomics that employ high-end machine learning methodologies. Of particular interest are models that predict certain genomic characteristics based solely on DNA sequence. These models, however, treat the DNA as a mere collection of four, A, T, G and C, letters, dismissing the past advancements in science that can enable the use of more intricate information from nucleic acid sequences. Here, we provide a comprehensive database of quantum mechanical (QM) and geometric features for all the permutations of 7-meric DNA in their representative B, A and Z conformations. The database is generated by employing the applicable high-cost and time-consuming QM methodologies. This can thus make it seamless to associate a wealth of novel molecular features to any DNA sequence, by scanning it with a matching k-meric window and pulling the pre-computed values from our database for further use in modelling. We demonstrate the usefulness of our deposited features through their exclusive use in developing a model for A->C mutation rates.
我们正在见证基因组学中模型开发计划的急剧增加,这些计划采用了高端机器学习方法。特别有趣的是那些仅基于 DNA 序列预测某些基因组特征的模型。然而,这些模型将 DNA 仅仅视为 A、T、G 和 C 这四个字母的简单集合,忽略了过去在科学上的进步,这些进步可以利用核酸序列中更复杂的信息。在这里,我们提供了一个全面的量子力学(QM)和几何特征数据库,用于代表 B、A 和 Z 构象的所有 7 聚体 DNA 的排列。该数据库是通过应用高成本和耗时的 QM 方法生成的。因此,可以通过使用匹配的 k 聚体窗口扫描任何 DNA 序列,并从我们的数据库中提取预先计算的值,以便在建模中进一步使用,从而将丰富的新型分子特征无缝地关联到任何 DNA 序列。我们通过仅使用我们存储的特征来开发 A->C 突变率模型来证明其有用性。