Kuroda Masataka
Discovery Technology Laboratories, Innovative Research Division, Mitsubishi Tanabe Pharma Corporation, 1000 Kamoshida, Aoba-ku, Yokohama, 227-0033 Japan.
J Cheminform. 2017 Jan 5;9:1. doi: 10.1186/s13321-016-0187-6. eCollection 2017.
Molecular descriptors have been widely used to predict biological activities and physicochemical properties or to analyze chemical libraries on the basis of similarity. Although fingerprints and properties are generally used as descriptors, neither is perfect for these purposes. A fingerprint can distinguish between molecules, whereas a property may not do the same in certain cases, and vice versa. When the number of the training set is especially small, the construction of good predictive models is difficult. Herein, a novel descriptor integrating mutually compensating fingerprint and property characteristics is described. The format of this descriptor is not conventional. It has two dimensions with variable length in one dimension to represent one molecule. This format is not acceptable for any machine learning methods. Therefore the distance between molecules has been newly defined for application to machine learning techniques. The evaluation of this descriptor, as applied to classification tasks, was performed using a support vector machine after the features of the descriptor had been optimized by a genetic algorithm.
Because the optimizing feature is time-intensive due to the complicated calculation of distances between molecules, the optimization was forced to stop before it was completed. As a result, no remarkable improvement was observed in the classification results for the new descriptor compared with those for other descriptors in any evaluation set used in this work. However, extremely low accuracies were also not found for any set.
The novel descriptor proposed in this work can potentially be used to make highly accurate predictive models. This new concept in descriptors is expected to be useful for developing novel predictive methods with quick training and high accuracy.
分子描述符已被广泛用于预测生物活性和物理化学性质,或基于相似性分析化学文库。虽然指纹和性质通常用作描述符,但两者在这些用途上都并非完美。指纹可以区分分子,而在某些情况下性质可能无法做到,反之亦然。当训练集数量特别小时,构建良好的预测模型很困难。在此,描述了一种整合相互补偿的指纹和性质特征的新型描述符。这种描述符的格式并不常规。它有两个维度,其中一个维度的长度可变以表示一个分子。这种格式对于任何机器学习方法都是不可接受的。因此,为了应用于机器学习技术,新定义了分子之间的距离。在通过遗传算法对描述符的特征进行优化之后,使用支持向量机对该描述符应用于分类任务进行评估。
由于优化特征因分子间距离的复杂计算而耗时,优化在完成前被迫停止。结果,与本研究中使用的任何评估集中其他描述符的分类结果相比,新描述符的分类结果没有观察到显著改善。然而,任何集合也未发现极低的准确率。
本工作中提出的新型描述符有潜力用于构建高度准确的预测模型。描述符中的这一新概念有望用于开发具有快速训练和高精度的新型预测方法。