Lee Hanbin, Craddock Rosalind Françoise, Gorjanc Gregor, Becher Hannes
Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, USA.
The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, EH25 9RG, UK.
Genet Sel Evol. 2025 Aug 28;57(1):46. doi: 10.1186/s12711-025-00994-y.
Pedigrees continue to be extremely important in agriculture and conservation genetics, with the pedigrees of modern breeding programmes easily comprising millions of records. This size can make visualising the structure of such pedigrees challenging. Being graphs, pedigrees can be represented as matrices, including, most commonly, the additive (numerator) relationship matrix, , and its inverse. With these matrices, the structure of pedigrees can then, in principle, be visualised via principal component analysis (PCA). However, the naive PCA of matrices for large pedigrees is challenging due to computational and memory constraints. Furthermore, computing a few leading principal components is usually sufficient for visualising the structure of a pedigree.
We present the open-access R package randPedPCA for rapid pedigree PCA using sparse matrices. Our rapid pedigree PCA builds on the fact that matrix-vector multiplications with the additive relationship matrix can be carried out implicitly using the extremely sparse inverse relationship factor, , which can be directly obtained from a given pedigree. We implemented two methods. Randomised singular value decomposition tends to be faster when very few principal components are requested, and Eigen decomposition via the RSpectra library tends to be faster when more principal components are of interest. On simulated data, our package delivers a speed-up greater than 10,000 times compared to naive PCA. It further enables analyses that are impossible with naive PCA. When only two principal components are desired, the randomised PCA method can half the running time required compared to RSpectra, which we demonstrate by analysing the pedigree of the UK Kennel Club registered Labrador Retriever population of almost 1.5 million individuals.
The leading principal components of pedigree matrices can be efficiently obtained using randomised singular value decomposition and other methods. Scatter plots of these scores allow for intuitive visualisation of large pedigrees. For large pedigrees, this is considerably faster than rendering plots of a pedigree graph.
系谱在农业和保护遗传学中仍然极其重要,现代育种计划的系谱很容易包含数百万条记录。如此庞大的规模使得可视化此类系谱的结构具有挑战性。作为图,系谱可以表示为矩阵,最常见的是加性(分子)关系矩阵及其逆矩阵。利用这些矩阵,原则上可以通过主成分分析(PCA)来可视化系谱的结构。然而,由于计算和内存限制,对大型系谱矩阵进行简单的PCA具有挑战性。此外,计算几个主要主成分通常足以可视化系谱的结构。
我们展示了用于使用稀疏矩阵进行快速系谱PCA的开放获取R包randPedPCA。我们的快速系谱PCA基于这样一个事实,即与加性关系矩阵的矩阵向量乘法可以使用极其稀疏的逆关系因子隐式地进行,该因子可以直接从给定的系谱中获得。我们实现了两种方法。当只需要很少的主成分时,随机奇异值分解往往更快,而当需要更多主成分时,通过RSpectra库进行特征值分解往往更快。在模拟数据上,与简单PCA相比,我们的包速度提升超过10000倍。它还能够进行简单PCA无法完成的分析。当只需要两个主成分时,随机PCA方法所需的运行时间与RSpectra相比可以减半,我们通过分析英国养犬俱乐部注册的近150万只拉布拉多猎犬种群的系谱来证明这一点。
可以使用随机奇异值分解和其他方法有效地获得系谱矩阵的主要主成分。这些得分的散点图允许直观地可视化大型系谱。对于大型系谱,这比绘制系谱图要快得多。