García-Sosa Alfonso T
Institute of Chemistry, University of Tartu, Ravila 14a, 54011 Tartu, Estonia.
Molecules. 2021 Feb 26;26(5):1285. doi: 10.3390/molecules26051285.
Substances that can modify the androgen receptor pathway in humans and animals are entering the environment and food chain with the proven ability to disrupt hormonal systems and leading to toxicity and adverse effects on reproduction, brain development, and prostate cancer, among others. State-of-the-art databases with experimental data of human, chimp, and rat effects by chemicals have been used to build machine-learning classifiers and regressors and to evaluate these on independent sets. Different featurizations, algorithms, and protein structures lead to different results, with deep neural networks (DNNs) on user-defined physicochemically relevant features developed for this work outperforming graph convolutional, random forest, and large featurizations. The results show that these user-provided structure-, ligand-, and statistically based features and specific DNNs provided the best results as determined by AUC (0.87), MCC (0.47), and other metrics and by their interpretability and chemical meaning of the descriptors/features. In addition, the same features in the DNN method performed better than in a multivariate logistic model: validation MCC = 0.468 and training MCC = 0.868 for the present work compared to evaluation set MCC = 0.2036 and training set MCC = 0.5364 for the multivariate logistic regression on the full, unbalanced set. Techniques of this type may improve AR and toxicity description and prediction, improving assessment and design of compounds. Source code and data are available on github.
能够改变人类和动物体内雄激素受体途径的物质正在进入环境和食物链,其已被证实有能力扰乱激素系统,并导致对生殖、大脑发育和前列腺癌等方面产生毒性和不良影响。利用包含化学物质对人类、黑猩猩和大鼠影响的实验数据的最新数据库来构建机器学习分类器和回归模型,并在独立数据集上对其进行评估。不同的特征提取方法、算法和蛋白质结构会导致不同的结果,针对这项工作开发的基于用户定义的物理化学相关特征的深度神经网络(DNN)优于图卷积、随机森林和大型特征提取方法。结果表明,这些用户提供的基于结构、配体和统计的特征以及特定的DNN产生了最佳结果,这由AUC(0.87)、MCC(0.47)和其他指标以及描述符/特征的可解释性和化学意义所决定。此外,DNN方法中的相同特征在性能上优于多元逻辑模型:本研究中验证集的MCC = 0.468,训练集的MCC = 0.868,而在完整的、不平衡数据集上进行多元逻辑回归时,评估集的MCC = 0.2036,训练集的MCC = 0.5364。这类技术可能会改善雄激素受体和毒性的描述与预测,从而改进化合物的评估和设计。源代码和数据可在github上获取。