Department of Applied Mathematics and Computer Science, Faculty of Sciences, Ghent University, Krijgslaan 281, S9, 9000 Gent, Belgium.
BMC Bioinformatics. 2012 Nov 23;13:312. doi: 10.1186/1471-2105-13-312.
Sampling core subsets from genetic resources while maintaining as much as possible the genetic diversity of the original collection is an important but computationally complex task for gene bank managers. The Core Hunter computer program was developed as a tool to generate such subsets based on multiple genetic measures, including both distance measures and allelic diversity indices. At first we investigate the effect of minimum (instead of the default mean) distance measures on the performance of Core Hunter. Secondly, we try to gain more insight into the performance of the original Core Hunter search algorithm through comparison with several other heuristics working with several realistic datasets of varying size and allelic composition. Finally, we propose a new algorithm (Mixed Replica search) for Core Hunter II with the aim of improving the diversity of the constructed core sets and their corresponding generation times.
Our results show that the introduction of minimum distance measures leads to core sets in which all accessions are sufficiently distant from each other, which was not always obtained when optimizing mean distance alone. Comparison of the original Core Hunter algorithm, Replica Exchange Monte Carlo (REMC), with simpler heuristics shows that the simpler algorithms often give very good results but with lower runtimes than REMC. However, the performance of the simpler algorithms is slightly worse than REMC under lower sampling intensities and some heuristics clearly struggle with minimum distance measures. In comparison the new advanced Mixed Replica search algorithm (MixRep), which uses heterogeneous replicas, was able to sample core sets with equal or higher diversity scores than REMC and the simpler heuristics, often using less computation time than REMC.
The REMC search algorithm used in the original Core Hunter computer program performs well, sometimes leading to slightly better results than some of the simpler methods, although it doesn't always give the best results. By switching to the new Mixed Replica algorithm overall results and runtimes can be significantly improved. Finally we recommend including minimum distance measures in the objective function when looking for core sets in which all accessions are sufficiently distant from each other. Core Hunter II is freely available as an open source project at http://www.corehunter.org.
从遗传资源中采样核心子集,同时尽可能保持原始集合的遗传多样性,这是基因库管理者的一项重要但计算复杂的任务。Core Hunter 计算机程序是作为一种工具开发的,用于根据多种遗传度量(包括距离度量和等位基因多样性指数)生成这样的子集。首先,我们研究了使用最小距离度量(而不是默认的平均值)对 Core Hunter 性能的影响。其次,我们尝试通过与几种使用不同大小和等位基因组成的真实数据集的其他启发式方法进行比较,更深入地了解原始 Core Hunter 搜索算法的性能。最后,我们提出了一种新的算法(混合副本搜索)用于 Core Hunter II,旨在提高构建的核心集的多样性及其对应的生成时间。
我们的结果表明,引入最小距离度量会导致核心集中的所有个体彼此之间足够远,这在单独优化平均值距离时并不总是可以得到。原始 Core Hunter 算法、Replica Exchange Monte Carlo (REMC)与更简单的启发式算法的比较表明,更简单的算法通常会产生非常好的结果,但运行时间比 REMC 短。然而,在采样强度较低的情况下,更简单的算法的性能略逊于 REMC,并且某些启发式算法在处理最小距离度量时明显存在困难。相比之下,新的高级混合副本搜索算法(MixRep),它使用异质副本,能够以与 REMC 和更简单的启发式算法相同或更高的多样性分数采样核心集,通常使用比 REMC 更少的计算时间。
原始 Core Hunter 计算机程序中使用的 REMC 搜索算法性能良好,有时会导致比一些更简单的方法略好的结果,尽管它并不总是给出最佳结果。通过切换到新的混合副本算法,整体结果和运行时间可以得到显著改善。最后,我们建议在寻找彼此之间足够远的所有个体的核心集时,将最小距离度量纳入目标函数中。Core Hunter II 可作为开源项目在 http://www.corehunter.org 上免费获得。