Animal Improvement Programs Laboratory, ARS, USDA, Beltsville, MD 20705-2350, USA.
J Anim Sci. 2012 Mar;90(3):723-33. doi: 10.2527/jas.2011-4584. Epub 2011 Nov 18.
Modern animal breeding data sets are large and getting larger, due in part to recent availability of high-density SNP arrays and cheap sequencing technology. High-performance computing methods for efficient data warehousing and analysis are under development. Financial and security considerations are important when using shared clusters. Sound software engineering practices are needed, and it is better to use existing solutions when possible. Storage requirements for genotypes are modest, although full-sequence data will require greater storage capacity. Storage requirements for intermediate and results files for genetic evaluations are much greater, particularly when multiple runs must be stored for research and validation studies. The greatest gains in accuracy from genomic selection have been realized for traits of low heritability, and there is increasing interest in new health and management traits. The collection of sufficient phenotypes to produce accurate evaluations may take many years, and high-reliability proofs for older bulls are needed to estimate marker effects. Data mining algorithms applied to large data sets may help identify unexpected relationships in the data, and improved visualization tools will provide insights. Genomic selection using large data requires a lot of computing power, particularly when large fractions of the population are genotyped. Theoretical improvements have made possible the inversion of large numerator relationship matrices, permitted the solving of large systems of equations, and produced fast algorithms for variance component estimation. Recent work shows that single-step approaches combining BLUP with a genomic relationship (G) matrix have similar computational requirements to traditional BLUP, and the limiting factor is the construction and inversion of G for many genotypes. A naïve algorithm for creating G for 14,000 individuals required almost 24 h to run, but custom libraries and parallel computing reduced that to 15 m. Large data sets also create challenges for the delivery of genetic evaluations that must be overcome in a way that does not disrupt the transition from conventional to genomic evaluations. Processing time is important, especially as real-time systems for on-farm decisions are developed. The ultimate value of these systems is to decrease time-to-results in research, increase accuracy in genomic evaluations, and accelerate rates of genetic improvement.
现代动物育种数据集越来越大,部分原因是最近高密度 SNP 芯片和廉价测序技术的出现。正在开发用于高效数据仓库和分析的高性能计算方法。在使用共享集群时,财务和安全考虑很重要。需要合理的软件工程实践,并且在可能的情况下最好使用现有解决方案。基因型的存储要求适中,尽管全序列数据将需要更大的存储容量。遗传评估的中间文件和结果文件的存储要求要大得多,特别是当需要为研究和验证研究存储多个运行时。基因组选择在准确性方面取得的最大进展是针对低遗传力性状,并且人们对新的健康和管理性状越来越感兴趣。要产生准确的评估,可能需要收集足够的表型多年,并且需要对旧公牛进行高可靠性验证,以估计标记效应。应用于大型数据集的数据挖掘算法可以帮助识别数据中的意外关系,并且改进的可视化工具将提供深入的了解。使用大型数据集进行基因组选择需要大量的计算能力,特别是当大量人群进行基因分型时。理论上的改进使得反转大型分子关系矩阵、解决大型方程组以及产生快速方差分量估计算法成为可能。最近的工作表明,结合 BLUP 和基因组关系 (G) 矩阵的单步方法与传统 BLUP 具有相似的计算要求,限制因素是为许多基因型构建和反转 G。为 14000 个人创建 G 的天真算法运行几乎需要 24 小时,但定制库和并行计算将其减少到 15 分钟。大型数据集也为遗传评估的交付带来了挑战,必须以不破坏从传统到基因组评估的过渡的方式克服这些挑战。处理时间很重要,尤其是在开发实时农场决策系统时。这些系统的最终价值在于减少研究中的结果时间,提高基因组评估的准确性,并加速遗传改进的速度。