Wu Chenhao, Chen Lei
College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
Math Biosci Eng. 2023 Jan;20(1):383-401. doi: 10.3934/mbe.2023018. Epub 2022 Oct 9.
Drugs are an important means to treat various diseases. They are classified into several classes to indicate their properties and effects. Those in the same class always share some important features. The Kyoto Encyclopedia of Genes and Genomes (KEGG) DRUG recently reported a new drug classification system that classifies drugs into 14 classes. Correct identification of the class for any possible drug-like compound is helpful to roughly determine its effects for a particular type of disease. Experiments could be conducted to confirm such latent effects, thus accelerating the procedures for discovering novel drugs. In this study, this classification system was investigated. A classification model was proposed to assign one of the classes in the system to any given drug for the first time. Different from traditional fingerprint features, which indicated essential drug properties alone and were very popular in investigating drug-related problems, drugs were represented by novel features derived from a large drug network via a well-known network embedding algorithm called Node2vec. These features abstracted the drug associations generated from their essential properties, and they could overview each drug with all drugs as background. As class sizes were of great differences, synthetic minority over-sampling technique (SMOTE) was employed to tackle the imbalance problem. A balanced dataset was fed into the support vector machine to build the model. The 10-fold cross-validation results suggested the excellent performance of the model. This model was also superior to models using other drug features, including those generated by another network embedding algorithm and fingerprint features. Furthermore, this model provided more balanced performance across all classes than that without SMOTE.
药物是治疗各种疾病的重要手段。它们被分为几类以表明其特性和效果。同一类别的药物总是具有一些重要特征。京都基因与基因组百科全书(KEGG)药物数据库最近报道了一种新的药物分类系统,该系统将药物分为14类。正确识别任何可能的类药物化合物的类别有助于大致确定其对特定类型疾病的作用。可以进行实验来证实这种潜在作用,从而加速发现新药的进程。在本研究中,对该分类系统进行了研究。首次提出了一种分类模型,用于将该系统中的一个类别分配给任何给定的药物。与传统的指纹特征不同,传统指纹特征仅表明药物的基本特性,在研究药物相关问题中非常流行,而这里的药物是通过一种名为Node2vec的著名网络嵌入算法从一个大型药物网络中衍生出的新特征来表示的。这些特征提取了由药物基本特性产生的药物关联,并且可以以所有药物为背景来全面了解每种药物。由于类别大小差异很大,因此采用合成少数过采样技术(SMOTE)来解决不平衡问题。将一个平衡的数据集输入支持向量机以构建模型。10折交叉验证结果表明该模型具有优异的性能。该模型也优于使用其他药物特征的模型,包括由另一种网络嵌入算法生成的特征和指纹特征。此外,与不使用SMOTE的情况相比,该模型在所有类别上提供了更平衡的性能。