Skorupka Agata
SGH Warsaw School of Economics, Warsaw, Poland.
PLoS One. 2024 Dec 23;19(12):e0315849. doi: 10.1371/journal.pone.0315849. eCollection 2024.
The study examines different graph-based methods of detecting anomalous activities on digital markets, proposing the most efficient way to increase market actors' protection and reduce information asymmetry. Anomalies are defined below as both bots and fraudulent users (who can be both bots and real people). Methods are compared against each other, and state-of-the-art results from the literature and a new algorithm is proposed. The goal is to find an efficient method suitable for threat detection, both in terms of predictive performance and computational efficiency. It should scale well and remain robust on the advancements of the newest technologies. The article utilized three publicly accessible graph-based datasets: one describing the Twitter social network (TwiBot-20) and two describing Bitcoin cryptocurrency markets (Bitcoin OTC and Bitcoin Alpha). In the former, an anomaly is defined as a bot, as opposed to a human user, whereas in the latter, an anomaly is a user who conducted a fraudulent transaction, which may (but does not have to) imply being a bot. The study proves that graph-based data is a better-performing predictor than text data. It compares different graph algorithms to extract feature sets for anomaly detection models. It states that methods based on nodes' statistics result in better model performance than state-of-the-art graph embeddings. They also yield a significant improvement in computational efficiency. This often means reducing the time by hours or enabling modeling on significantly larger graphs (usually not feasible in the case of embeddings). On that basis, the article proposes its own graph-based statistics algorithm. Furthermore, using embeddings requires two engineering choices: the type of embedding and its dimension. The research examines whether there are types of graph embeddings and dimensions that perform significantly better than others. The solution turned out to be dataset-specific and needed to be tailored on a case-by-case basis, adding even more engineering overhead to using embeddings (building a leaderboard of grid of embedding instances, where each of them takes hours to be generated). This, again, speaks in favor of the proposed algorithm based on nodes' statistics. The research proposes its own efficient algorithm, which makes this engineering overhead redundant.
该研究考察了基于图的不同方法,用于检测数字市场上的异常活动,提出了提高市场参与者保护水平和减少信息不对称的最有效方法。异常情况在下文定义为机器人程序和欺诈用户(包括机器人程序和真实人类)。对各种方法进行了相互比较,并引用了文献中的最新成果,还提出了一种新算法。目标是找到一种在预测性能和计算效率方面都适合威胁检测的有效方法。它应具有良好的扩展性,并在最新技术发展的情况下保持稳健。本文使用了三个可公开获取的基于图的数据集:一个描述推特社交网络(TwiBot-20),另外两个描述比特币加密货币市场(比特币场外交易平台和比特币阿尔法平台)。在前者中,异常定义为机器人程序,与人类用户相对,而在后者中,异常是指进行了欺诈性交易的用户,这可能(但不一定)意味着是机器人程序。该研究证明基于图的数据是比文本数据性能更好的预测器。它比较了不同的图算法,以提取用于异常检测模型的特征集。研究表明,基于节点统计的方法比最新的图嵌入方法能产生更好的模型性能。它们在计算效率上也有显著提高。这通常意味着将时间缩短数小时,或者能够在大得多的图上进行建模(在嵌入方法的情况下通常不可行)。在此基础上,本文提出了自己基于图的统计算法。此外,使用嵌入需要两个工程选择:嵌入类型及其维度。该研究考察了是否存在比其他类型和维度表现显著更好的图嵌入类型。结果表明解决方案是特定于数据集的,需要逐案定制,这给使用嵌入方法增加了更多工程开销(构建嵌入实例网格的排行榜,其中每个实例的生成都需要数小时)。这再次表明了基于节点统计的所提出算法的优势。该研究提出了自己的高效算法,使这种工程开销变得多余。