Fourches Denis, Tropsha Alexander
Laboratory for Molecular Modeling, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA.
Mol Inform. 2013 Oct;32(9-10):827-42. doi: 10.1002/minf.201300076. Epub 2013 Sep 9.
In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications.
在化学信息学中,化合物被表示为化学描述符多维空间中的点。当在原始高维化学空间中某一距离阈值内找到的所有点对都由带距离标签的边连接时,所得数据结构可定义为数据集图(DG)。我们表明,与有机分子的传统描述类似,也可以为数据集图计算许多图指标。我们证明,通过计算简单的图指标,如平均顶点度或兰迪奇连通性指数,可以有效地表征和比较化学数据集。该方法用于表征和量化不同数据集或同一数据集的子集(例如,QSAR建模中使用的训练集、测试集和外部验证集)之间的相似性。已实现免费的ADDAGRA程序来构建和可视化数据集图。本报告中提出和讨论的方法可进一步探索并用于不同的化学信息学应用,如通过获取外部化合物实现数据集多样化、QSAR建模前的数据集处理,或化学基因组学应用中研究的多个数据集的(不)相似性建模。