Sylvester Emma V A, Bentzen Paul, Bradbury Ian R, Clément Marie, Pearce Jon, Horne John, Beiko Robert G
Faculty of Computer Science Dalhousie University Halifax NS Canada.
Marine Gene Probe Laboratory Department of Biology Dalhousie University Halifax NS Canada.
Evol Appl. 2017 Sep 14;11(2):153-165. doi: 10.1111/eva.12524. eCollection 2018 Feb.
Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon () and a published SNP data set for Alaskan Chinook salmon (). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than -selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using -selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.
用于为野生动物管理和保护工作提供信息的遗传种群分配需要高信息量的遗传标记面板和灵敏的分配测试。我们探讨了机器学习算法(随机森林、正则化随机森林和引导正则化随机森林)与排序法相比在选择单核苷酸多态性(SNP)以进行精细尺度种群分配方面的效用。我们将这些方法应用于一个未发表的大西洋鲑()SNP数据集和一个已发表的阿拉斯加奇努克鲑()SNP数据集。在每个物种中,我们确定了使用每种方法创建50 - 700个标记的面板以获得至少90%的自我分配准确率所需的最小面板大小。对于大西洋鲑和奇努克鲑数据集,使用基于随机森林的方法识别出的SNP面板分别比类似大小的 - 选择面板表现好7.8和11.2个百分点。每个数据集分别使用670个和384个SNP的面板获得了≥90%的自我分配准确率,这是使用 - 选择面板从未达到过的这些物种的准确率水平。我们的结果证明了机器学习方法在跨大型基因组数据集进行标记选择以改善对受开发种群的管理和保护分配方面的作用。