UMR 7268 ADES, CNRS, Aix-Marseille Université, EFS, Faculté de Médecine Timone, Marseille 13005, France.
School of Mathematics and Statistics, Faculty of Science, Technology, Engineering and Mathematics, The Open University, Milton Keynes MK7 6AA, UK; Department of Genetics, Evolution and Environment, and UCL Genetics Institute, University College London, London WC1E 6BT, UK.
Forensic Sci Int Genet. 2021 Jul;53:102517. doi: 10.1016/j.fsigen.2021.102517. Epub 2021 Apr 6.
Here we evaluate the accuracy of prediction for eye, hair and skin pigmentation in a dataset of > 6500 individuals from Mexico, Colombia, Peru, Chile and Brazil (including genome-wide SNP data and quantitative/categorical pigmentation phenotypes - the CANDELA dataset CAN). We evaluated accuracy in relation to different analytical methods and various phenotypic predictors. As expected from statistical principles, we observe that quantitative traits are more sensitive to changes in the prediction models than categorical traits. We find that Random Forest or Linear Regression are generally the best performing methods. We also compare the prediction accuracy of SNP sets defined in the CAN dataset (including 56, 101 and 120 SNPs for eye, hair and skin colour prediction, respectively) to the well-established HIrisPlex-S SNP set (including 6, 22 and 36 SNPs for eye, hair and skin colour prediction respectively). When training prediction models on the CAN data, we observe remarkably similar performances for HIrisPlex-S and the larger CAN SNP sets for the prediction of hair (categorical) and eye (both categorical and quantitative), while the CAN sets outperform HIrisPlex-S for quantitative, but not for categorical skin pigmentation prediction. The performance of HIrisPlex-S, when models are trained in a world-wide sample (although consisting of 80% Europeans, https://hirisplex.erasmusmc.nl), is lower relative to training in the CAN data (particularly for hair and skin colour). Altogether, our observations are consistent with common variation of eye and hair colour having a relatively simple genetic architecture, which is well captured by HIrisPlex-S, even in admixed Latin Americans (with partial European ancestry). By contrast, since skin pigmentation is a more polygenic trait, accuracy is more sensitive to prediction SNP set size, although here this effect was only apparent for a quantitative measure of skin pigmentation. Our results support the use of HIrisPlex-S in the prediction of categorical pigmentation traits for forensic purposes in Latin America, while illustrating the impact of training datasets on its accuracy.
在这里,我们评估了在来自墨西哥、哥伦比亚、秘鲁、智利和巴西的 6500 多人的数据集中对眼睛、头发和皮肤色素沉着进行预测的准确性(包括全基因组 SNP 数据和定量/分类色素沉着表型 - CANDELA 数据集 CAN)。我们根据不同的分析方法和各种表型预测因子评估了准确性。正如统计原理所预期的那样,我们观察到定量特征比分类特征对预测模型的变化更敏感。我们发现随机森林或线性回归通常是性能最佳的方法。我们还比较了 CAN 数据集(分别包括用于眼睛、头发和皮肤颜色预测的 56、101 和 120 个 SNP)中定义的 SNP 集与既定的 HIrisPlex-S SNP 集(分别包括用于眼睛、头发和皮肤颜色预测的 6、22 和 36 个 SNP)的预测准确性。当在 CAN 数据上训练预测模型时,我们观察到 HIrisPlex-S 和更大的 CAN SNP 集在头发(分类)和眼睛(分类和定量)的预测中表现非常相似,而 CAN 集在定量方面优于 HIrisPlex-S,但在分类皮肤色素沉着预测中则不然。当在全球样本(尽管其中 80%为欧洲人,https://hirisplex.erasmusmc.nl)中训练模型时,HIrisPlex-S 的性能相对于在 CAN 数据中训练时降低(尤其是对于头发和皮肤颜色)。总的来说,我们的观察结果与眼睛和头发颜色的常见变异具有相对简单的遗传结构一致,即使在混合的拉丁裔美国人(具有部分欧洲血统)中,HIrisPlex-S 也能很好地捕捉到这一点。相比之下,由于皮肤色素沉着是一个多基因性状,准确性对预测 SNP 集大小更敏感,尽管这种影响仅在皮肤色素沉着的定量测量中明显。我们的结果支持在拉丁美洲将 HIrisPlex-S 用于预测分类色素沉着性状的法医目的,同时说明了训练数据集对其准确性的影响。