Degen Bernd, Yanbaev Yulai, Müller Niels A
Thünen Institute of Forest Genetics, Grosshansdorf, Germany.
Bashkir State Agrarian University, Ufa, Russia.
PLoS One. 2025 Jun 6;20(6):e0324994. doi: 10.1371/journal.pone.0324994. eCollection 2025.
Origin tracking is important to ensure use of the right seed source and trade with legally harvested timber. Additionally, it can help to reconstruct human-caused historical long-distance seed transfer and to spot mislabelling in forest field trials. So far, genetic assignment approaches were mostly discrete, assigning test samples to predefined groups. The main limitation of this approach is the justification of these discrete groups when genetic variation across the landscape is actually continuous. Here, we compare the accuracy of five continuous assignment methods. Specifically, we test a nearest neighbour method (NN), direct gaussian process regression (GPR-D) using the radial basis kernel function, grid based gaussian process regression (GPR-G) applying the Matérn kernel function, genomic prediction (GP) and deep learning (DL), using two genome-wide single nucleotide polymorphism (SNP) datasets of trees from across Europe. The first dataset comprises 30,000 SNPs from 865 European beech (Fagus sylvatica) trees, the second dataset consists of 381 SNPs from 1,883 pedunculate oak (Quercus robur) trees. The accuracy, as measured by the geographic distance between true and predicted locations, was highest for the GPR-G and DL methods with the beech dataset with a median distance of only 55 km and 76 km, respectively. For the oak data GPR-G and DL also performed best with median distances of 263 km and 278 km, respectively. The relative error (distance/max distance among tree pairs) was below 8% for 90% of all samples for the best method for both datasets. We detected 35 individuals and 10 groups as outliers in the beech data and 27 individuals and 18 groups in the oak data. These outliers may be caused by mislabelling or historical human-caused long distance seed transfer. We discuss the differences in performance of the approaches and highlight future applications and potential for further improvements.
溯源对于确保使用正确的种子来源以及合法采伐木材的贸易至关重要。此外,它有助于重建人为造成的历史远距离种子转移,并发现森林田间试验中的标签错误。到目前为止,遗传分配方法大多是离散的,将测试样本分配到预定义的组中。这种方法的主要局限性在于,当整个景观中的遗传变异实际上是连续的时候,这些离散组的合理性。在这里,我们比较了五种连续分配方法的准确性。具体来说,我们使用来自欧洲各地树木的两个全基因组单核苷酸多态性(SNP)数据集,测试了最近邻方法(NN)、使用径向基核函数的直接高斯过程回归(GPR-D)、应用Matérn核函数的基于网格的高斯过程回归(GPR-G)、基因组预测(GP)和深度学习(DL)。第一个数据集包含来自865棵欧洲山毛榉(Fagus sylvatica)树的30000个SNP,第二个数据集由来自1883棵英国栎(Quercus robur)树的381个SNP组成。对于山毛榉数据集,以真实位置和预测位置之间的地理距离衡量,GPR-G和DL方法的准确性最高,中位数距离分别仅为55公里和76公里。对于栎树数据,GPR-G和DL也表现最佳,中位数距离分别为263公里和278公里。对于两个数据集的最佳方法,90%的样本的相对误差(距离/树对之间的最大距离)低于8%。我们在山毛榉数据中检测到35个个体和10个组为异常值,在栎树数据中检测到27个个体和18个组为异常值。这些异常值可能是由标签错误或历史上人为造成的远距离种子转移引起的。我们讨论了这些方法在性能上的差异,并强调了未来的应用和进一步改进的潜力。