Davies Simon R, Macfarlane Richard, Buchanan William J
Blockpass ID Lab, School of Computing, Edinburgh Napier University, Edinburgh EH10 5DT, UK.
Entropy (Basel). 2022 Oct 21;24(10):1503. doi: 10.3390/e24101503.
Ransomware is a malicious class of software that utilises encryption to implement an attack on system availability. The target's data remains encrypted and is held captive by the attacker until a ransom demand is met. A common approach used by many crypto-ransomware detection techniques is to monitor file system activity and attempt to identify encrypted files being written to disk, often using a file's entropy as an indicator of encryption. However, often in the description of these techniques, little or no discussion is made as to why a particular entropy calculation technique is selected or any justification given as to why one technique is selected over the alternatives. The Shannon method of entropy calculation is the most commonly-used technique when it comes to file encryption identification in crypto-ransomware detection techniques. Overall, correctly encrypted data should be indistinguishable from random data, so apart from the standard mathematical entropy calculations such as Chi-Square (χ2), Shannon Entropy and Serial Correlation, the test suites used to validate the output from pseudo-random number generators would also be suited to perform this analysis. The hypothesis being that there is a fundamental difference between different entropy methods and that the best methods may be used to better detect ransomware encrypted files. The paper compares the accuracy of 53 distinct tests in being able to differentiate between encrypted data and other file types. The testing is broken down into two phases, the first phase is used to identify potential candidate tests, and a second phase where these candidates are thoroughly evaluated. To ensure that the tests were sufficiently robust, the NapierOne dataset is used. This dataset contains thousands of examples of the most commonly used file types, as well as examples of files that have been encrypted by crypto-ransomware. During the second phase of testing, 11 candidate entropy calculation techniques were tested against more than 270,000 individual files-resulting in nearly three million separate calculations. The overall accuracy of each of the individual test's ability to differentiate between files encrypted using crypto-ransomware and other file types is then evaluated and each test is compared using this metric in an attempt to identify the entropy method most suited for encrypted file identification. An investigation was also undertaken to determine if a hybrid approach, where the results of multiple tests are combined, to discover if an improvement in accuracy could be achieved.
勒索软件是一类恶意软件,它利用加密技术对系统可用性实施攻击。目标数据会一直保持加密状态,并被攻击者控制,直到满足赎金要求。许多加密勒索软件检测技术常用的一种方法是监控文件系统活动,并尝试识别写入磁盘的加密文件,通常将文件的熵作为加密的一个指标。然而,在这些技术的描述中,往往很少或根本没有讨论为什么选择特定的熵计算技术,也没有给出选择一种技术而不是其他技术的任何理由。在加密勒索软件检测技术中,涉及文件加密识别时,香农熵计算方法是最常用的技术。总体而言,正确加密的数据应该与随机数据无法区分,所以除了诸如卡方检验(χ2)、香农熵和序列相关性等标准数学熵计算方法外,用于验证伪随机数生成器输出的测试套件也适合进行此分析。其假设是不同的熵方法之间存在根本差异,并且最好的方法可能用于更好地检测勒索软件加密文件。本文比较了53种不同测试在区分加密数据和其他文件类型方面的准确性。测试分为两个阶段,第一阶段用于识别潜在的候选测试,第二阶段对这些候选测试进行全面评估。为确保测试足够稳健,使用了NapierOne数据集。该数据集包含数千个最常用文件类型的示例以及被加密勒索软件加密的文件示例。在测试的第二阶段,针对超过270,000个单独文件测试了11种候选熵计算技术,产生了近三百万次单独计算。然后评估每个单独测试区分使用加密勒索软件加密的文件和其他文件类型的能力的总体准确性,并使用此指标比较每个测试,以试图确定最适合加密文件识别的熵方法。还进行了一项调查,以确定是否采用混合方法,即将多个测试的结果结合起来,看是否能提高准确性。