Michels James, Bandarupalli Ramya, Akbari Amin Ahangar, Le Thai, Xiao Hong, Li Jing, Hom Erik F Y
Department of Computer Science, University of Mississippi, University, MS.
Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS.
ArXiv. 2024 Oct 17:arXiv:2409.13057v2.
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted, including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases of existing datasets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
自然语言处理(NLP)彻底改变了计算机用于研究人类语言并与之交互的方式,并且在蛋白质与配体结合的研究中越来越有影响力,而这种结合对于药物发现和开发至关重要。本综述探讨了NLP技术如何被用于解码蛋白质和小分子配体的“语言”,以预测蛋白质-配体相互作用(PLIs)。我们讨论了诸如长短期记忆(LSTM)网络、变换器和注意力机制等方法如何利用不同的蛋白质和配体数据类型来识别潜在的相互作用模式。突出了重大挑战,包括高质量阴性数据的稀缺、解释模型决策的困难以及现有数据集的采样偏差。我们认为,专注于提高数据质量、增强模型稳健性以及促进合作与竞争能够推动基于机器学习的PLIs预测的未来进展。