Markić Ivan, Štula Maja, Zorić Marija, Stipaničev Darko
Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia.
Department of Electronics and Computing, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia.
Entropy (Basel). 2020 Dec 28;23(1):31. doi: 10.3390/e23010031.
The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns' entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.
字符串匹配范式通常应用于每一个计算机科学及其他科学分支。大量字符串匹配算法的存在使得为任何特定情况选择最佳算法变得困难。表达、衡量和测试算法效率是一项具有许多潜在陷阱的挑战性任务。算法效率可以基于不同资源的使用情况来衡量。在软件工程中,算法生产率是算法执行的一种属性,它与算法消耗的计算资源相关。可以确定算法执行中的资源使用情况,为了实现最高效率,目标是尽量减少资源使用。由于算法效率的标准度量(如执行时间)直接取决于执行的操作数量,在不涉及计算机功耗或内存问题(这也取决于算法类型和算法开发中使用的技术)的情况下,我们开发了一种方法,使研究人员能够为特定领域选择高效算法。字符串搜索算法的效率通常是独立于被搜索的领域文本进行观察的。本研究论文旨在提出算法效率取决于被搜索字符串的属性和被搜索文本的属性这一观点,并对所提出的方法进行理论分析。在所提出的方法中,算法效率通过字符比较计数指标来表示。字符比较计数指标是一种形式上的定量度量,独立于算法实现的细微差别和计算机平台差异。通过使用适当的领域数据(模式和文本)为特定问题领域开发该模型,并根据模式的熵为特定领域提供算法排名。所提出的方法限于基于搜索模式的信息熵的在线精确字符串匹配问题。细致的实证测试描述了该方法的实现,并证明了该方法的合理性。