Liu Zhanghui, Zhang Yudong, Chen Yuzhong, Fan Xinwen, Dong Chen
Fujian Key Laboratory of Network Computing and Intelligent Information Processing, College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China.
Key Laboratory of Spatial Data Mining & Information Sharing, Ministry of Education, Fuzhou 350116, China.
Entropy (Basel). 2020 Sep 22;22(9):1058. doi: 10.3390/e22091058.
Domain generation algorithms (DGAs) use specific parameters as random seeds to generate a large number of random domain names to prevent malicious domain name detection. This greatly increases the difficulty of detecting and defending against botnets and malware. Traditional models for detecting algorithmically generated domain names generally rely on manually extracting statistical characteristics from the domain names or network traffic and then employing classifiers to distinguish the algorithmically generated domain names. These models always require labor intensive manual feature engineering. In contrast, most state-of-the-art models based on deep neural networks are sensitive to imbalance in the sample distribution and cannot fully exploit the discriminative class features in domain names or network traffic, leading to decreased detection accuracy. To address these issues, we employ the borderline synthetic minority over-sampling algorithm (SMOTE) to improve sample balance. We also propose a recurrent convolutional neural network with spatial pyramid pooling (RCNN-SPP) to extract discriminative and distinctive class features. The recurrent convolutional neural network combines a convolutional neural network (CNN) and a bi-directional long short-term memory network (Bi-LSTM) to extract both the semantic and contextual information from domain names. We then employ the spatial pyramid pooling strategy to refine the contextual representation by capturing multi-scale contextual information from domain names. The experimental results from different domain name datasets demonstrate that our model can achieve 92.36% accuracy, an 89.55% recall rate, a 90.46% F1-score, and 95.39% AUC in identifying DGA and legitimate domain names, and it can achieve 92.45% accuracy rate, a 90.12% recall rate, a 90.86% F1-score, and 96.59% AUC in multi-classification problems. It achieves significant improvement over existing models in terms of accuracy and robustness.
域名生成算法(DGAs)使用特定参数作为随机种子来生成大量随机域名,以防止恶意域名被检测到。这大大增加了检测和防御僵尸网络及恶意软件的难度。传统的检测算法生成域名的模型通常依赖于从域名或网络流量中手动提取统计特征,然后使用分类器来区分算法生成的域名。这些模型总是需要大量人工的手动特征工程。相比之下,大多数基于深度神经网络的最先进模型对样本分布的不平衡很敏感,并且不能充分利用域名或网络流量中的判别性类别特征,导致检测准确率下降。为了解决这些问题,我们采用边界合成少数类过采样算法(SMOTE)来改善样本平衡。我们还提出了一种带有空间金字塔池化的循环卷积神经网络(RCNN-SPP)来提取判别性和独特的类别特征。循环卷积神经网络结合了卷积神经网络(CNN)和双向长短期记忆网络(Bi-LSTM),以从域名中提取语义和上下文信息。然后我们采用空间金字塔池化策略,通过从域名中捕获多尺度上下文信息来细化上下文表示。来自不同域名数据集的实验结果表明,我们的模型在识别DGA和合法域名时可以达到92.36%的准确率、89.55%的召回率、90.46%的F1分数和95.39%的AUC,并且在多分类问题中可以达到92.45%的准确率、90.12%的召回率、90.86%的F1分数和96.59%的AUC。在准确率和鲁棒性方面,它比现有模型有显著提高。