Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.
Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China.
Yi Chuan. 2022 Nov 20;44(11):1028-1043. doi: 10.16288/j.yczz.22-073.
Han Chinese, Korean and Japanese are the main populations of East Asia, and Han Chinese presents a gradient admixture from north to south. There are differences among the East Asian populations in genetic structure. To achieve fine-scale genetic classification of southern (S-) and northern (N-) Han Chinese, Korean and Japanese individuals in this study, we collected and analyzed 1185 ancestry informative SNPs (AISNPs) from previous literature reports and our laboratory findings. First, two machine learning algorithms, softmax and randomForest, were used to build genetic classification models. Then, phylogenetic tree, STRUCTURE and principal component analysis were used to evaluate the performance of classification for different AISNP panels. The 234-AISNP panel achieved a fine-scale differentiation among the target populations in four classification schemes. The accuracy of the softmax model was 92%, which realized the accurate classification of the S-Han, N-Han, Korean and Japanese individuals. The two machine learning models tested in this study provided important references for the high-resolution discrimination of close-range populations and will be useful tools to optimize marker panels for developing forensic DNA ancestry inference systems.
汉族、朝鲜族和日本人群体是东亚的主要人群,其中汉族人群从北到南呈现出逐渐混合的趋势。东亚人群在遗传结构上存在差异。为了对南方(S-)和北方(N-)汉族、朝鲜族和日本个体进行精细的遗传分类,我们从先前的文献报告和实验室研究中收集和分析了 1185 个祖先信息 SNP(AISNP)。首先,我们使用两种机器学习算法(softmax 和 randomForest)构建遗传分类模型。然后,我们使用系统发育树、STRUCTURE 和主成分分析来评估不同 AISNP 面板的分类性能。在四种分类方案中,234-AISNP 面板实现了对目标人群的精细分化。softmax 模型的准确率为 92%,实现了对 S-汉族、N-汉族、朝鲜族和日本个体的准确分类。本研究中测试的两种机器学习模型为近距离人群的高分辨率区分提供了重要参考,并将成为优化标记面板以开发法医 DNA 祖先推断系统的有用工具。