Nielson Felicity F, Kay Bill, Young Stephen J, Colby Sean M, Renslow Ryan S, Metz Thomas O
Pacific Northwest National Laboratory, Biological Sciences Division, Richland, WA 99354, USA.
Pacific Northwest National Laboratory, Advanced Computing, Mathematics, and Data Division, Richland, WA 99354, USA.
Metabolites. 2023 Jan 9;13(1):105. doi: 10.3390/metabo13010105.
Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called "reference-free metabolomics" methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the of the most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar = 4 items, one or more of the items from = 4 might not be in the set = 5. An exact solution would have to search all possible combinations of size in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes = 3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution.
由于传统库构建中可用的真实参考材料数量有限,用于创建分子描述符(例如碰撞截面)的计算机模拟库的计算方法正变得越来越普遍。这些所谓的“无参考代谢组学”方法需要对分子构象体进行采样集,以便产生高精度的性质预测。由于对每个构象体进行后续计算的计算成本,需要对最相关的子集进行采样,并避免对几乎相同的构象体重复计算。本研究的目标是引入一种启发式方法,从更大的群体中找到最不相似的构象体,以帮助加速无参考计算方法并保持较高的性质预测准确性。随着(n)或群体规模的增大,从更大的群体中找到彼此最不相似的(n)个项目集变得越来越困难且计算成本高昂。因为群体中每个项目与所有其他项目之间存在成对关系,找到(n)个最不相似项目的集合不同于简单地对数字数组进行排序。例如,如果有一组最不相似的(n = 4)个项目,那么(n = 5)中的一个或多个项目可能不在(n = 4)的集合中。精确的解决方案必须详尽地搜索群体中大小为(n)的所有可能组合。我们提出了一个名为相似性下选(SDS)的开源软件,用Python编写,可在GitHub上免费获取。SDS实现了一种启发式算法,用于快速找到(n)个最不相似项目的近似集合。我们将SDS与蒙特卡罗方法进行基准测试,蒙特卡罗方法试图通过重复随机采样找到精确解。我们表明,对于SDS找到(n)个最不相似的构象体集合,我们的方法不仅快几个数量级,而且比运行100万次迭代的蒙特卡罗方法更准确,每次蒙特卡罗方法在50000个群体中搜索大小为(n = 3-7)的集合。我们还将SDS与小群体示例的精确解进行基准测试,表明在这些情况下SDS产生的解接近精确解。使用理论方法,我们还证明了贪心算法的约束及其作为与精确解的比率的有效性。