评估超参数对知识图谱嵌入质量的影响。

Assessing the effects of hyperparameters on knowledge graph embedding quality.

作者信息

Lloyd Oliver, Liu Yi, R Gaunt Tom

机构信息

MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, UK.

出版信息

J Big Data. 2023;10(1):59. doi: 10.1186/s40537-023-00732-5. Epub 2023 May 6.

DOI:10.1186/s40537-023-00732-5

PMID:37168524

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10164002/

Abstract

UNLABELLED

Embedding knowledge graphs into low-dimensional spaces is a popular method for applying approaches, such as link prediction or node classification, to these databases. This embedding process is very costly in terms of both computational time and space. Part of the reason for this is the optimisation of hyperparameters, which involves repeatedly sampling, by random, guided, or brute-force selection, from a large hyperparameter space and testing the resulting embeddings for their quality. However, not all hyperparameters in this search space will be equally important. In fact, with prior knowledge of the relative importance of the hyperparameters, some could be eliminated from the search altogether without significantly impacting the overall quality of the outputted embeddings. To this end, we ran a Sobol sensitivity analysis to evaluate the effects of tuning different hyperparameters on the variance of embedding quality. This was achieved by performing thousands of embedding trials, each time measuring the quality of embeddings produced by different hyperparameter configurations. We regressed the embedding quality on those hyperparameter configurations, using this model to generate Sobol sensitivity indices for each of the hyperparameters. By evaluating the correlation between Sobol indices, we find substantial variability in the hyperparameter sensitivities between knowledge graphs with differing dataset characteristics as the probable cause of these inconsistencies. As an additional contribution of this work we identify several relations in the UMLS knowledge graph that may cause data leakage via inverse relations, and derive and present UMLS-43, a leakage-robust variant of that graph.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s40537-023-00732-5.

摘要

未标注

将知识图谱嵌入低维空间是一种将链接预测或节点分类等方法应用于这些数据库的常用方法。这种嵌入过程在计算时间和空间方面都非常昂贵。部分原因在于超参数的优化，这涉及从一个大的超参数空间中通过随机、引导或暴力选择进行反复采样，并测试所得嵌入的质量。然而，并非这个搜索空间中的所有超参数都同等重要。事实上，有了超参数相对重要性的先验知识，一些超参数可以完全从搜索中排除，而不会显著影响输出嵌入的整体质量。为此，我们进行了索伯尔敏感性分析，以评估调整不同超参数对嵌入质量方差的影响。这是通过进行数千次嵌入试验来实现的，每次测量不同超参数配置产生的嵌入质量。我们将嵌入质量对那些超参数配置进行回归，使用这个模型为每个超参数生成索伯尔敏感性指数。通过评估索伯尔指数之间的相关性，我们发现具有不同数据集特征的知识图谱之间超参数敏感性存在很大差异，这可能是这些不一致的原因。作为这项工作的额外贡献，我们在UMLS知识图谱中识别出一些可能通过逆关系导致数据泄露的关系，并推导并展示了UMLS - 43，即该图谱的一个抗泄露变体。