Hudson Nicholas J, Porto-Neto Laercio, Kijas James W, Reverter Antonio
CSIRO Agriculture, Computational and Systems Biology, 306 Carmody Road, St. Lucia, Brisbane, QLD, 4075, Australia.
Genet Sel Evol. 2015 Oct 13;47:78. doi: 10.1186/s12711-015-0158-9.
Genetic relatedness is currently estimated by a combination of traditional pedigree-based approaches (i.e. numerator relationship matrices, NRM) and, given the recent availability of molecular information, using marker genotypes (via genomic relationship matrices, GRM). To date, GRM are computed by genome-wide pair-wise SNP (single nucleotide polymorphism) correlations.
We describe a new estimate of genetic relatedness using the concept of normalised compression distance (NCD) that is borrowed from Information Theory. Analogous to GRM, the resultant compression relationship matrix (CRM) exploits numerical patterns in genome-wide allele order and proportion, which are known to vary systematically with relatedness. We explored properties of the CRM in two industry cattle datasets by analysing the genetic basis of yearling weight, a phenotype of moderate heritability. In both Brahman (Bos indicus) and Tropical Composite (Bos taurus by Bos indicus) populations, the clustering inferred by NCD was comparable to that based on SNP correlations using standard principal component analysis approaches. One of the versions of the CRM modestly increased the amount of explained genetic variance, slightly reduced the 'missing heritability' and tended to improve the prediction accuracy of breeding values in both populations when compared to both NRM and GRM. Finally, a sliding window-based application of the compression approach on these populations identified genomic regions influenced by introgression of taurine haplotypes.
For these two bovine populations, CRM reduced the missing heritability and increased the amount of explained genetic variation for a moderately heritable complex trait. Given that NCD can sensitively discriminate closely related individuals, we foresee CRM having possible value for estimating breeding values in highly inbred populations.
目前,遗传相关性是通过传统的基于系谱的方法(即分子亲缘关系矩阵,NRM)以及鉴于最近分子信息的可用性,利用标记基因型(通过基因组亲缘关系矩阵,GRM)相结合来估计的。迄今为止,GRM是通过全基因组成对单核苷酸多态性(SNP)相关性来计算的。
我们描述了一种使用从信息论中借用的归一化压缩距离(NCD)概念对遗传相关性进行的新估计。与GRM类似,所得的压缩关系矩阵(CRM)利用了全基因组等位基因顺序和比例中的数值模式,已知这些模式会随着相关性而系统地变化。我们通过分析一岁体重的遗传基础(一种中等遗传力的表型),在两个行业牛数据集中探索了CRM的特性。在婆罗门牛(印度瘤牛)和热带复合牛(普通牛与印度瘤牛杂交)群体中,由NCD推断出的聚类与使用标准主成分分析方法基于SNP相关性的聚类相当。与NRM和GRM相比,CRM的一个版本适度增加了解释的遗传方差量,略微降低了“缺失遗传力”,并且在两个群体中都倾向于提高育种值的预测准确性。最后,在这些群体上基于滑动窗口应用压缩方法识别出了受普通牛单倍型渗入影响的基因组区域。
对于这两个牛群体,CRM减少了缺失遗传力,并增加了对中等遗传力复杂性状解释的遗传变异量。鉴于NCD可以敏感地区分亲缘关系密切的个体,我们预见CRM在估计高度近交群体的育种值方面可能具有价值。