Suppr超能文献

利用深度学习对跨多种检测方法的抗SARS-CoV-2分子的结合活性进行分类

Binding Activity Classification of Anti-SARS-CoV-2 Molecules using Deep Learning Across Multiple Assays.

作者信息

Yamasan Bilge Eren, Korkmaz Selçuk

机构信息

Department of Biophysics, Trakya University Faculty of Medicine, Edirne, Türkiye

Department of Biostatistics and Medical Informatics, Trakya University Faculty of Medicine, Edirne, Türkiye

出版信息

Balkan Med J. 2024 May 3;41(3):186-192. doi: 10.4274/balkanmedj.galenos.2024.2024-1-73. Epub 2024 Mar 11.

Abstract

BACKGROUND

The coronavirus disease-2019 (COVID-19) pandemic, caused by severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2), has urgently necessitated effective therapeutic solutions, with a focus on rapidly identifying and classifying potential small-molecule drugs. Given traditional methods’ labor-intensive and time-consuming nature, deep learning has emerged as an essential tool for efficiently processing and extracting insights from complex biological data.

AIMS

To utilize deep learning techniques, particularly deep neural networks (DNN) enhanced with the synthetic minority oversampling technique (SMOTE), to enhance the classification of binding activities in anti-SARS-CoV-2 molecules across various bioassays.

METHODS

We used 11 bioassay datasets covering various SARS-CoV-2 interactions and inhibitory mechanisms. These assays ranged from spike-ACE2 protein-protein interaction to ACE2 enzymatic activity and 3CL enzymatic activity. To address the prevalent class imbalance in these datasets, the SMOTE technique was employed to generate new samples for the minority class. In our model-building approach, we divided the dataset into 80% training and 20% test sets, reserving 10% of the training set for validation. Our approach involved employing a DNN that integrates ReLU and sigmoid activation functions, incorporates batch normalization, and uses Adam optimization. The hyperparameters and architecture of the DNN were optimized through various tests on layers, minibatch sizes, epoch sizes, and learning rates. A 40% dropout rate was incorporated to mitigate overfitting. For model evaluation, we computed performance metrics, such as balanced accuracy (BACC), precision, recall, F1 score, Matthews’ correlation coefficient (MCC), and area under the curve (AUC).

RESULTS

The performance of the DNN across 11 bioassay test sets revealed varying outcomes, significantly influenced by the ratios of active-to-inactive compounds. Assays, such as AlphaLISA and CoV-PPE, demonstrated robust performance across various metrics, including BACC, precision, recall, and AUC, when configured with more balanced ratios (1:3 and 1:1, respectively). This suggests the effective identification of active compounds in both cases. In contrast, assays with higher imbalance ratios, such as 3CL (1:38) and cytopathic effect (1:15), demonstrated higher recall but lower precision, highlighting challenges in accurately identifying active compounds among numerous inactive compounds. However, even in these challenging settings, the model achieved favorable BACC and recall scores. Overall, the DNN model generally performed well, as indicated by the BACC, MCC, and AUC values, especially when considering the degree of dataset imbalance in each assay.

CONCLUSION

This study demonstrates the significant impact of deep learning, particularly DNN models enhanced with SMOTE, in improving the identification of active compounds in bioassay datasets for COVID-19 drug discovery, outperforming traditional machine learning models. Furthermore, this study highlights the efficacy of advanced computational techniques in addressing high-throughput screening data imbalances.

摘要

背景

由严重急性呼吸综合征冠状病毒2(SARS-CoV-2)引起的2019冠状病毒病(COVID-19)大流行迫切需要有效的治疗方案,重点是快速识别和分类潜在的小分子药物。鉴于传统方法劳动强度大且耗时,深度学习已成为有效处理和从复杂生物数据中提取见解的重要工具。

目的

利用深度学习技术,特别是通过合成少数过采样技术(SMOTE)增强的深度神经网络(DNN),以提高跨各种生物测定的抗SARS-CoV-2分子结合活性的分类。

方法

我们使用了11个生物测定数据集,涵盖各种SARS-CoV-2相互作用和抑制机制。这些测定范围从刺突蛋白-血管紧张素转换酶2(ACE2)蛋白质-蛋白质相互作用到ACE2酶活性和3C样蛋白酶(3CL)酶活性。为了解决这些数据集中普遍存在的类不平衡问题,采用SMOTE技术为少数类生成新样本。在我们的模型构建方法中,我们将数据集分为80%的训练集和20%的测试集,保留10%的训练集用于验证。我们的方法包括使用集成ReLU和Sigmoid激活函数、纳入批量归一化并使用Adam优化的DNN。通过对层数、小批量大小、轮次大小和学习率进行各种测试,对DNN的超参数和架构进行了优化。纳入40%的随机失活率以减轻过拟合。为了进行模型评估,我们计算了性能指标,如平衡准确率(BACC)、精确率、召回率、F1分数、马修斯相关系数(MCC)和曲线下面积(AUC)。

结果

DNN在11个生物测定测试集上的性能显示出不同的结果,受活性与非活性化合物比例的显著影响。当配置为更平衡的比例(分别为1:3和1:1)时,诸如AlphaLISA和CoV-PPE等测定在包括BACC、精确率、召回率和AUC在内的各种指标上表现出强劲性能。这表明在这两种情况下都能有效识别活性化合物。相比之下,具有更高不平衡比例的测定,如3CL(1:38)和细胞病变效应(1:15),显示出更高的召回率但更低的精确率,突出了在众多非活性化合物中准确识别活性化合物的挑战。然而,即使在这些具有挑战性的情况下,该模型仍取得了良好的BACC和召回率分数。总体而言,DNN模型总体表现良好,BACC、MCC和AUC值表明了这一点,特别是考虑到每个测定中数据集的不平衡程度时。

结论

本研究证明了深度学习,特别是通过SMOTE增强的DNN模型,在改善用于COVID-19药物发现的生物测定数据集中活性化合物的识别方面具有重大影响,优于传统机器学习模型。此外,本研究突出了先进计算技术在解决高通量筛选数据不平衡方面的功效。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d4c6/11077922/b918b98512a3/BMJ-41-186-g1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验