Department of Forensic Medicine, Guizhou Medical University, Guiyang, P. R. China.
Medical Genetics Institute of Henan Province, Henan Provincial People's Hospital,Zhengzhou University People's Hospital, Zhengzhou, P. R. China.
Electrophoresis. 2021 Aug;42(14-15):1473-1479. doi: 10.1002/elps.202100044. Epub 2021 May 19.
A lot of population data of 30 deletion/insertion polymorphisms (DIPs) of the Investigator DIPplex kit in different continental populations have been reported. Here, we assessed genetic distributions of these 30 DIPs in different continental populations to pinpoint candidate ancestry informative DIPs. Besides, the effectiveness of machine learning methods for ancestry analysis was explored. Pairwise informativeness (In) values of 30 DIPs revealed that six loci displayed relatively high In values (>0.1) among different continental populations. Besides, more loci showed high population-specific divergence (PSD) values in African population. Based on the pairwise In and PSD values of 30 DIPs, 17 DIPs in the Investigator DIPplex kit were selected to ancestry analyses of African, European, and East Asian populations. Even though 30 DIPs provided better ancestry resolution of these continental populations based on the results of PCA and population genetic structure, we found that 17 DIPs could also distinguish these continental populations. More importantly, these 17 DIPs possessed more balanced cumulative PSD distributions in these populations. Six machine learning methods were used to perform ancestry analyses of these continental populations based on 17 DIPs. Obtained results revealed that naïve Bayes manifested the greatest performance; whereas, k nearest neighbor showed relatively low performance. To sum up, these machine learning methods, especially for naïve Bayes, could be used as the valuable tool for ancestry analysis.
大量的人群数据表明,30 个缺失/插入多态性(DIPs)在不同的大陆人群中得到了报道。在这里,我们评估了这些 30 个 DIP 在不同大陆人群中的遗传分布,以确定候选的祖先信息性 DIPs。此外,还探讨了机器学习方法在祖先分析中的有效性。30 个 DIP 的成对信息量(In)值表明,六个位点在不同的大陆人群中显示出相对较高的 In 值(>0.1)。此外,更多的位点在非洲人群中表现出较高的种群特异性发散(PSD)值。基于 30 个 DIP 的成对 In 和 PSD 值,选择了 Investigator DIPplex 试剂盒中的 17 个 DIP 进行非洲、欧洲和东亚人群的祖先分析。尽管 30 个 DIP 基于 PCA 和种群遗传结构的结果为这些大陆人群提供了更好的祖先分辨率,但我们发现 17 个 DIP 也可以区分这些大陆人群。更重要的是,这些 17 个 DIP 在这些人群中具有更平衡的累积 PSD 分布。使用六种机器学习方法基于 17 个 DIP 对这些大陆人群进行了祖先分析。得到的结果表明,朴素贝叶斯表现出最好的性能;而 k 最近邻方法的表现相对较低。总之,这些机器学习方法,特别是朴素贝叶斯,可作为祖先分析的有价值工具。