Qian Yu, Peng Qianqian, Qian Qili, Gao Xingjian, Liu Xinxuan, Li Yi, Fan Xiu, Cheng Yuan, Yuan Na, Hadi Sibte, Jin Li, Wang Sijia, Liu Fan
Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, China National Center for Bioinformation, Chinese Academy of Sciences, Beijing, China.
Beijing No.8 High School, Beijing, China.
Int J Legal Med. 2025 May;139(3):1193-1203. doi: 10.1007/s00414-024-03365-2. Epub 2024 Dec 5.
Estimating individual age from DNA methylation at age associated CpG sites may provide key information facilitating forensic investigations. Systematic marker screening and feature selection play a critical role in ensuring the performance of the final prediction model. In the discovery stage, we screened for 811876 CpGs from whole blood of 2664 Chinese individuals ranging from 18 to 83 years of age based on a stepwise conditional epigenome-wide association study (SCEWAS). The SCEWAS identified 28 CpGs showing genome-wide significant and independent effects. Further restricting this panel to 10 most informative CpGs showed a tolerable loss of information. A linear model consisting of these 10 CpGs could explain 93% of the age variance (R = 0.93) in the training set (n = 2664). In an independent test set of Chinese individuals (n = 648), this model also provided highly accurate predictions (R = 0.85, mean absolute deviation, MAD = 3.20 years). The model was additionally validated in a public dataset of multiple ancestral origins (86 Europeans, 14 Asians, and 273 Africans) and the prediction accuracy reduced significantly (R = 0.85, MAD = 6.21 years), as might be expected due to different genomic backgrounds, sample sizes, and age ranges. Our 10 CpG model also outperformed the recently proposed 9-CpG model constructed in 390 Chinese males (R = 0.79 in test set). We also demonstrated that our SCEWAS approach outperformed the traditional EWAS and the elastic net approach in obtaining a small set of most age informative CpGs. Overall, our systematic genome-wide feature selection identified a small panel of 10 CpGs for accurate age estimation with high potential in forensic applications.
通过与年龄相关的CpG位点的DNA甲基化来估计个体年龄,可能会为法医调查提供关键信息。系统的标记筛选和特征选择对于确保最终预测模型的性能起着至关重要的作用。在发现阶段,我们基于逐步条件全基因组关联研究(SCEWAS),从2664名年龄在18至83岁的中国个体的全血中筛选了811876个CpG。SCEWAS鉴定出28个显示全基因组显著且独立效应的CpG。进一步将该组限制为10个信息最丰富的CpG,显示出可容忍的信息损失。由这10个CpG组成的线性模型可以解释训练集(n = 2664)中93%的年龄方差(R = 0.93)。在一个独立的中国个体测试集(n = 648)中,该模型也提供了高度准确的预测(R = 0.85,平均绝对偏差,MAD = 3.20岁)。该模型还在一个具有多个祖先来源的公共数据集(86名欧洲人、14名亚洲人和273名非洲人)中得到验证,由于不同的基因组背景、样本大小和年龄范围,预测准确性显著降低(R = 0.85,MAD = 6.21岁)。我们的10个CpG模型也优于最近在390名中国男性中构建的9个CpG模型(测试集中R = 0.79)。我们还证明,我们的SCEWAS方法在获得一小部分最能反映年龄信息的CpG方面优于传统的全基因组关联研究(EWAS)和弹性网络方法。总体而言,我们系统的全基因组特征选择确定了一个由10个CpG组成的小组,用于准确的年龄估计,在法医应用中具有很高的潜力。