Zahoránszky-Kőhalmi Gergely, Bologa Cristian G, Oprea Tudor I
Translational Informatics Division, University of New Mexico School of Medicine, MSC09 5025, Albuquerque, NM 87131 USA.
J Cheminform. 2016 Mar 30;8:16. doi: 10.1186/s13321-016-0127-5. eCollection 2016.
Complex network theory based methods and the emergence of "Big Data" have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework.
Using the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called "pseudo-reference" clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance.
We observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets.
基于复杂网络理论的方法以及“大数据”的出现重塑了分子结构-活性关系的研究领域。这种变化催生了新的方法,这些方法需要面对一个重要挑战,即:如何将一个大型分子数据集重构为一个最适合后续分析目的的网络。我们的研究特别关注网络聚类,通过提出一种数据转换方法和一个聚类框架来解决这个开放性问题。
使用WOMBAT和PubChem MLSMR数据集,我们研究了在相似性矩阵上应用的相似性阈值变化与新兴的基于相似性的网络的平均聚类系数之间的关系。然后使用InfoMap算法对这些相似性网络进行聚类。我们设计了一种系统方法来生成所谓的“伪参考”聚类数据集,以弥补大规模参考数据集的不足。借助聚类框架,我们能够观察到相似性阈值变化的影响及其对平均聚类系数和聚类性能的后果。
我们观察到平均聚类系数与相似性阈值函数的特征是存在一个覆盖一定相似性阈值范围的峰值。在这个峰值之前,相似性网络的边数会急剧下降。这个峰值的最大值与最佳聚类结果高度吻合。因此,如果没有可用的参考集,选择与这个峰值相关的相似性阈值将是后续网络聚类分析的近乎理想的设置。所提出的方法可以用作确定合适的相似性阈值以生成大规模分子数据集的相似性网络的通用方法。