Lopez-Perez Kenneth, Zhao Bill, Miranda-Quintana Ramón Alain
Department of Chemistry Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States.
J Chem Inf Model. 2025 Jul 14;65(13):6797-6808. doi: 10.1021/acs.jcim.5c00894. Epub 2025 Jun 17.
The average and variance of the molecular similarities in a set are of high value and useful for cheminformatics tasks such as chemical space exploration and subset selection. However, the calculation of the variance of the complete similarity matrix has a quadratic complexity, (). As the sizes of molecular libraries constantly increase, this pairwise approach is unfeasible. In this work, we present an approach to calculate the exact standard deviation of molecular similarities in a set (with molecules and features) for the Russell-Rao (RR) and Sokal-Michener (SM) similarity indexes in () complexity. Furthermore, we present a highly accurate linear complexity approximation, (), based on sampling representative molecules from the set. The proposed approximation can be extended to other similarity indices, including the popular Jaccard-Tanimoto (JT). With only the sampling of 50 molecules, the proposed method can estimate the standard deviation of similarities in a set with an RMSE lower than 0.01 for sets of up to 50,000 molecules. In comparison, random sampling does not warrant a good approximation with the same number of selected molecules as shown in our results.
一组分子相似性的平均值和方差具有很高的价值,对于化学信息学任务(如化学空间探索和子集选择)很有用。然而,完整相似性矩阵方差的计算具有二次复杂度()。随着分子库规模不断增大,这种成对方法不可行。在这项工作中,我们提出了一种方法,用于在()复杂度下计算一组(具有分子和特征)分子相似性的精确标准差,适用于罗素 - 饶(RR)和索卡尔 - 米切纳(SM)相似性指数。此外,我们基于从该组中采样代表性分子,提出了一种高精度的线性复杂度近似方法()。所提出的近似方法可扩展到其他相似性指数,包括流行的杰卡德 - 谷本(JT)指数。仅通过采样50个分子,对于多达50,000个分子的集合,所提出的方法能够估计相似性的标准差,其均方根误差(RMSE)低于0.01。相比之下,如我们的结果所示,随机采样在选择相同数量分子时不能保证良好的近似效果。