Suppr超能文献

在模拟奶牛校准群体中,针对不同疾病发病率和基因组结构的疾病易感性,采用随机森林法估计基因组育种值。

Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups.

作者信息

Naderi S, Yin T, König S

机构信息

Department of Animal Breeding, University of Kassel, 37213 Witzenhausen, Germany.

Department of Animal Breeding, University of Kassel, 37213 Witzenhausen, Germany.

出版信息

J Dairy Sci. 2016 Sep;99(9):7261-7273. doi: 10.3168/jds.2016-10887. Epub 2016 Jun 22.

Abstract

A simulation study was conducted to investigate the performance of random forest (RF) and genomic BLUP (GBLUP) for genomic predictions of binary disease traits based on cow calibration groups. Training and testing sets were modified in different scenarios according to disease incidence, the quantitative-genetic background of the trait (h(2)=0.30 and h(2)=0.10), and the genomic architecture [725 quantitative trait loci (QTL) and 290 QTL, populations with high and low levels of linkage disequilibrium (LD)]. For all scenarios, 10,005 SNP (depicting a low-density 10K SNP chip) and 50,025 SNP (depicting a 50K SNP chip) were evenly spaced along 29 chromosomes. Training and testing sets included 20,000 cows (4,000 sick, 16,000 healthy, disease incidence 20%) from the last 2 generations. Initially, 4,000 sick cows were assigned to the testing set, and the remaining 16,000 healthy cows represented the training set. In the ongoing allocation schemes, the number of sick cows in the training set increased stepwise by moving 10% of the sick animals from the testing set to the training set, and vice versa. The size of the training and testing sets was kept constant. Evaluation criteria for both GBLUP and RF were the correlations between genomic breeding values and true breeding values (prediction accuracy), and the area under the receiving operating characteristic curve (AUROC). Prediction accuracy and AUROC increased for both methods and all scenarios as increasing percentages of sick cows were allocated to the training set. Highest prediction accuracies were observed for disease incidences in training sets that reflected the population disease incidence of 0.20. For this allocation scheme, the largest prediction accuracies of 0.53 for RF and of 0.51 for GBLUP, and the largest AUROC of 0.66 for RF and of 0.64 for GBLUP, were achieved using 50,025 SNP, a heritability of 0.30, and 725 QTL. Heritability decreases from 0.30 to 0.10 and QTL reduction from 725 to 290 were associated with decreasing prediction accuracy and decreasing AUROC for all scenarios. This decrease was more pronounced for RF. Also, the increase of LD had stronger effect on RF results than on GBLUP results. The highest prediction accuracy from the low LD scenario was 0.30 from RF and 0.36 from GBLUP, and increased to 0.39 for both methods in the high LD population. Random forest successfully identified important SNP in close map distance to QTL explaining a high proportion of the phenotypic trait variations.

摘要

开展了一项模拟研究,以调查基于奶牛校准群体对二元疾病性状进行基因组预测时随机森林(RF)和基因组最佳线性无偏预测(GBLUP)的性能。根据疾病发病率、性状的数量遗传背景(h² = 0.30和h² = 0.10)以及基因组结构[725个数量性状位点(QTL)和290个QTL,具有高和低连锁不平衡(LD)水平的群体],在不同场景下对训练集和测试集进行了修改。对于所有场景,10,005个单核苷酸多态性(SNP)(代表低密度10K SNP芯片)和50,025个SNP(代表50K SNP芯片)沿着29条染色体均匀分布。训练集和测试集包括来自最近两代的20,000头奶牛(4,000头发病,16,000头健康,疾病发病率20%)。最初,4,000头发病奶牛被分配到测试集,其余16,000头健康奶牛代表训练集。在持续的分配方案中,训练集中发病奶牛的数量通过将10%的发病动物从测试集转移到训练集而逐步增加,反之亦然。训练集和测试集的大小保持不变。GBLUP和RF的评估标准都是基因组育种值与真实育种值之间的相关性(预测准确性)以及接受者操作特征曲线下的面积(AUROC)。随着分配到训练集的发病奶牛百分比增加,两种方法在所有场景下的预测准确性和AUROC都有所提高。在反映群体疾病发病率为0.20的训练集中观察到最高的预测准确性。对于此分配方案,使用50,025个SNP、遗传力为0.30和725个QTL时,RF的最大预测准确性为0.53,GBLUP为0.51,RF的最大AUROC为0.66,GBLUP为0.64。遗传力从0.30降至0.10以及QTL从725减少到290与所有场景下预测准确性降低和AUROC降低相关。这种降低在RF中更为明显。此外,LD的增加对RF结果的影响比对GBLUP结果的影响更强。低LD场景下的最高预测准确性,RF为0.30,GBLUP为0.36,在高LD群体中两种方法均提高到0.39。随机森林成功识别出与QTL紧密连锁的重要SNP,这些SNP解释了很大比例的表型性状变异。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验