Instituto de Ciencias e Ingeniería de la Computación (ICIC), Universidad Nacional del Sur-CONICET, San Andrés 800 - Campus Palihue, 8000, Bahía Blanca, Argentina.
Centro de Investigaciones Biológicas, Consejo Superior de Investigaciones Científicas (CSIC), Ramiro de Maeztu 9, 28040, Madrid, Spain.
Sci Rep. 2017 May 25;7(1):2403. doi: 10.1038/s41598-017-02114-3.
Quantitative structure-activity relationship modeling using machine learning techniques constitutes a complex computational problem, where the identification of the most informative molecular descriptors for predicting a specific target property plays a critical role. Two main general approaches can be used for this modeling procedure: feature selection and feature learning. In this paper, a performance comparative study of two state-of-art methods related to these two approaches is carried out. In particular, regression and classification models for three different issues are inferred using both methods under different experimental scenarios: two drug-like properties, such as blood-brain-barrier and human intestinal absorption, and enantiomeric excess, as a measurement of purity used for chiral substances. Beyond the contrastive analysis of feature selection and feature learning methods as competitive approaches, the hybridization of these strategies is also evaluated based on previous results obtained in material sciences. From the experimental results, it can be concluded that there is not a clear winner between both approaches because the performance depends on the characteristics of the compound databases used for modeling. Nevertheless, in several cases, it was observed that the accuracy of the models can be improved by combining both approaches when the molecular descriptor sets provided by feature selection and feature learning contain complementary information.
使用机器学习技术进行定量构效关系建模是一个复杂的计算问题,其中确定最具信息量的分子描述符以预测特定目标性质起着关键作用。对于这种建模过程,可以使用两种主要的一般方法:特征选择和特征学习。在本文中,对这两种方法相关的两种最先进的方法进行了性能比较研究。特别是,在不同的实验场景下,使用这两种方法推断了三种不同问题的回归和分类模型:两种类药性,如血脑屏障和人肠吸收,以及对映体过量,作为手性物质纯度的测量。除了对比特征选择和特征学习方法作为竞争方法外,还根据材料科学中获得的先前结果评估了这些策略的混合。从实验结果可以得出结论,由于性能取决于用于建模的化合物数据库的特征,因此这两种方法之间没有明显的赢家。然而,在某些情况下,当特征选择和特征学习提供的分子描述符集包含互补信息时,观察到通过组合这两种方法可以提高模型的准确性。