Sharma Rahul, Saghapour Ehsan, Chen Jake Y
Informatics Institute, School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA.
iScience. 2024 Feb 8;27(3):109127. doi: 10.1016/j.isci.2024.109127. eCollection 2024 Mar 15.
NLP is a well-established field in ML for developing language models that capture the sequence of words in a sentence. Similarly, drug molecule structures can also be represented as sequences using the SMILES notation. However, unlike natural language texts, special characters in drug SMILES have specific meanings and cannot be ignored. We introduce a novel NLP-based method that extracts interpretable sequences and essential features from drug SMILES notation using N-grams. Our method compares these features to Morgan fingerprint bit-vectors using UMAP-based embedding, and we validate its effectiveness through two personalized drug screening (PSD) case studies. Our NLP-based features are sparse and, when combined with gene expressions and disease phenotype features, produce better ML models for PSD. This approach provides a new way to analyze drug molecule structures represented as SMILES notation, which can help accelerate drug discovery efforts. We have also made our method accessible through a Python library.
自然语言处理(NLP)是机器学习(ML)中一个成熟的领域,用于开发捕捉句子中单词序列的语言模型。同样,药物分子结构也可以使用SMILES符号表示为序列。然而,与自然语言文本不同,药物SMILES中的特殊字符具有特定含义,不能被忽略。我们引入了一种基于NLP的新方法,该方法使用N元语法从药物SMILES符号中提取可解释的序列和基本特征。我们的方法使用基于UMAP的嵌入将这些特征与摩根指纹位向量进行比较,并通过两个个性化药物筛选(PSD)案例研究验证其有效性。我们基于NLP的特征是稀疏的,当与基因表达和疾病表型特征结合时,能为PSD生成更好的ML模型。这种方法为分析以SMILES符号表示的药物分子结构提供了一种新途径,有助于加速药物发现工作。我们还通过一个Python库使我们的方法易于使用。