Department of Chemistry, University of Turin, Turin, Italy.
Centro Regionale Antidoping e di Tossicologia "A. Bertinaria", Orbassano, Torino, Italy.
Sci Rep. 2022 May 28;12(1):8974. doi: 10.1038/s41598-022-12903-0.
The biogeographical ancestry (BGA) of a trace or a person/skeleton refers to the component of ethnicity, constituted of biological and cultural elements, that is biologically determined. Nowadays, many individuals are interested in exploring their genealogy, and the capability to distinguish biogeographic information about population groups and subgroups via DNA analysis plays an essential role in several fields such as in forensics. In fact, for investigative and intelligence purposes, it is beneficial to inference the biogeographical origins of perpetrators of crimes or victims of unsolved cold cases when no reference profile from perpetrators or database hits for comparative purposes are available. Current approaches for biogeographical ancestry estimation using SNPs data are usually based on PCA and Structure software. The present study provides an alternative method that involves multivariate data analysis and machine learning strategies to evaluate BGA discriminating power of unknown samples using different commercial panels. Starting from 1000 Genomes project, Simons Genome Diversity Project and Human Genome Diversity Project datasets involving African, American, Asian, European and Oceania individuals, and moving towards further and more geographically restricted populations, powerful multivariate techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) and machine learning techniques such as XGBoost were employed, and their discriminating power was compared. PLS-DA method provided more robust classifications than XGBoost method, showing that the adopted approach might be an interesting tool for forensic experts to infer BGA information from the DNA profile of unknown individuals, but also highlighting that the commercial forensic panels could be inadequate to discriminate populations at intra-continental level.
该痕迹或个人/骨骼的生物地理祖先(BGA)是指构成种族的组成部分,包括生物和文化元素,这些元素是由生物决定的。如今,许多人都有兴趣探索自己的家谱,而通过 DNA 分析区分人群和亚群的生物地理信息的能力在法医等多个领域中起着至关重要的作用。事实上,为了调查和情报目的,当没有可用于比较的犯罪者或未解决的冷案受害者的参考档案时,推断犯罪者或受害者的生物地理起源是有益的。目前使用 SNP 数据进行生物地理祖先估计的方法通常基于 PCA 和 Structure 软件。本研究提供了一种替代方法,涉及多元数据分析和机器学习策略,以使用不同的商业面板评估未知样本的 BGA 判别能力。从 1000 基因组计划、西蒙斯基因组多样性计划和人类基因组多样性计划涉及非洲、美国、亚洲、欧洲和大洋洲个体的数据集开始,进一步扩展到更多的地理限制人群,采用了偏最小二乘判别分析(PLS-DA)等强大的多元技术和 XGBoost 等机器学习技术,并比较了它们的判别能力。PLS-DA 方法提供了比 XGBoost 方法更稳健的分类,表明所采用的方法可能是法医专家从未知个体的 DNA 谱推断 BGA 信息的有趣工具,但也突出表明商业法医面板可能不足以在大陆内水平上区分人群。