King's Forensics, Department of Analytical, Environmental and Forensic Sciences, Faculty of Life Sciences and Medicine, King's College London, London, UK.
Methods Mol Biol. 2022;2432:187-200. doi: 10.1007/978-1-0716-1994-0_14.
Recent research studies using epigenetic data have been exploring whether it is possible to estimate how old someone is using only their DNA. This application stems from the strong correlation that has been observed in humans between the methylation status of certain DNA loci and chronological age. While genome-wide methylation sequencing has been the most prominent approach in epigenetics research, recent studies have shown that targeted sequencing of a limited number of loci can be successfully used for the estimation of chronological age from DNA samples, even when using small datasets. Following this shift, the need to investigate further into the appropriate statistics behind the predictive models used for DNA methylation-based prediction has been identified in multiple studies. This chapter will look into an example of basic data manipulation and modeling that can be applied to small DNA methylation datasets (100-400 samples) produced through targeted methylation sequencing for a small number of predictors (10-25 methylation sites). Data manipulation will focus on converting the obtained methylation values for the different predictors to a statistically meaningful dataset, followed by a basic introduction into importing such datasets in R, as well as randomizing and splitting into appropriate training and test sets for modeling. Finally, a basic introduction to R modeling will be outlined, starting with feature selection algorithms and continuing with a simple modeling example (linear model) as well as a more complex algorithm (Support Vector Machine).
最近使用表观遗传数据的研究一直在探索是否可以仅通过 DNA 来估计一个人的年龄。这种应用源于在人类中观察到的一个很强的相关性,即某些 DNA 位点的甲基化状态与年龄呈正相关。虽然全基因组甲基化测序一直是表观遗传学研究中最突出的方法,但最近的研究表明,即使使用小数据集,对有限数量的位点进行靶向测序也可以成功地用于从 DNA 样本中估计年龄。随着这种转变,多个研究已经确定需要进一步研究用于基于 DNA 甲基化预测的预测模型背后的适当统计数据。本章将探讨一个适用于通过靶向甲基化测序生成的小型 DNA 甲基化数据集(100-400 个样本)的基本数据操作和建模示例,这些数据集的预测因子数量较少(10-25 个甲基化位点)。数据操作将侧重于将不同预测因子的获得的甲基化值转换为具有统计学意义的数据集,然后介绍在 R 中导入此类数据集的基础知识,以及对建模进行随机化和分割为适当的训练集和测试集。最后,将概述 R 建模的基础知识,从特征选择算法开始,然后继续介绍一个简单的建模示例(线性模型)以及更复杂的算法(支持向量机)。