Department of Biochemistry and Molecular Biology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA.
Department of Bioscience and Technology for Food, Agriculture and Environment, University of Teramo, 64100 Teramo, Italy.
Int J Mol Sci. 2023 Jul 29;24(15):12144. doi: 10.3390/ijms241512144.
This research introduces a novel pipeline that couples machine learning (ML), and molecular docking for accelerating the process of small peptide ligand screening through the prediction of peptide-protein docking. Eight ML algorithms were analyzed for their potential. Notably, Light Gradient Boosting Machine (LightGBM), despite having comparable F1-score and accuracy to its counterparts, showcased superior computational efficiency. LightGBM was used to classify peptide-protein docking performance of the entire tetrapeptide library of 160,000 peptide ligands against four viral envelope proteins. The library was classified into two groups, 'better performers' and 'worse performers'. By training the LightGBM algorithm on just 1% of the tetrapeptide library, we successfully classified the remaining 99%with an accuracy range of 0.81-0.85 and an F1-score between 0.58-0.67. Three different molecular docking software were used to prove that the process is not software dependent. With an adjustable probability threshold (from 0.5 to 0.95), the process could be accelerated by a factor of at least 10-fold and still get 90-95% concurrence with the method without ML. This study validates the efficiency of machine learning coupled to molecular docking in rapidly identifying top peptides without relying on high-performance computing power, making it an effective tool for screening potential bioactive compounds.
这项研究介绍了一种新的流水线,该流水线将机器学习(ML)和分子对接相结合,通过预测肽-蛋白对接来加速小肽配体筛选的过程。分析了八种 ML 算法的潜力。值得注意的是,尽管 Light Gradient Boosting Machine(LightGBM)的 F1 得分和准确性与其同类相当,但它具有更高的计算效率。LightGBM 用于对来自四个病毒包膜蛋白的 16 万个肽配体的整个四肽库的肽-蛋白对接性能进行分类。该文库被分为两组,“表现更好”和“表现更差”。通过仅在四肽文库的 1%上训练 LightGBM 算法,我们成功地以 0.81-0.85 的准确率和 0.58-0.67 的 F1 分数对其余 99%进行了分类。使用三种不同的分子对接软件证明了该过程不依赖于软件。通过调整概率阈值(从 0.5 到 0.95),该过程可以加速 10 倍以上,并且仍然可以与不使用 ML 的方法达到 90-95%的一致性。这项研究验证了机器学习与分子对接相结合在快速识别顶级肽方面的效率,而无需依赖高性能计算能力,使其成为筛选潜在生物活性化合物的有效工具。