Röchner Philipp, Rothlauf Franz
Information Systems and Business Administration, Johannes Gutenberg University, Mainz 55128, Germany.
Cancer Registry Rhineland-Palatinate, Institute for Digital Health Data, Mainz 55116, Germany.
J Am Med Inform Assoc. 2025 Apr 1;32(4):724-735. doi: 10.1093/jamia/ocaf011.
Explanations help to understand why anomaly detection algorithms identify data as anomalous. This study evaluates whether robustly standardized explanation scores correctly identify the implausible variables that make cancer data anomalous.
The dataset analyzed consists of 18 587 truncated real-world cancer registry records containing 8 categorical variables describing patients diagnosed with bladder and lung tumors. We identified 800 anomalous records using an autoencoder's per-record reconstruction error, which is a common neural network-based anomaly detection approach. For each variable of a record, we determined a robust explanation score, which indicates how anomalous the variable is. A variable's robust explanation score is the autoencoder's per-variable reconstruction error measured by cross-entropy and robustly standardized across records; that is, large reconstruction errors have a small effect on standardization. To evaluate the explanation scores, medical coders identified the implausible variables of the anomalous records. We then compare the explanation scores to the medical coders' validation in a classification and ranking setting. As baselines, we identified anomalous variables using the raw autoencoder's per-variable reconstruction error, the non-robustly standardized per-variable reconstruction error, the empirical frequency of implausible variables according to the medical coders' validation, and random selection or ranking of variables.
When we sort the variables by their robust explanation scores, on average, the 2.37 highest-ranked variables contain all implausible variables. For the baselines, on average, the 2.84, 2.98, 3.27, and 4.91 highest-ranked variables contain all the variables that made a record implausible.
We found that explanations based on robust explanation scores were better than or as good as the baseline explanations examined in the classification and ranking settings. Due to the international standardization of cancer data coding, we expect our results to generalize to other cancer types and registries. As we anticipate different magnitudes of per-variable autoencoder reconstruction errors in data from other medical registries and domains, these may also benefit from robustly standardizing the reconstruction errors per variable. Future work could explore methods to identify subsets of anomalous variables, addressing whether individual variables or their combinations contribute to anomalies. This direction aims to improve the interpretability and utility of anomaly detection systems.
Robust explanation scores can improve explanations for identifying implausible variables in cancer data.
解释有助于理解异常检测算法为何将数据识别为异常。本研究评估稳健标准化的解释分数是否能正确识别使癌症数据异常的不合理变量。
所分析的数据集包含18587条截断的真实世界癌症登记记录,其中包含8个分类变量,描述了被诊断患有膀胱癌和肺癌的患者。我们使用自动编码器的每条记录重建误差识别出800条异常记录,这是一种常见的基于神经网络的异常检测方法。对于记录中的每个变量,我们确定一个稳健解释分数,该分数表明变量的异常程度。变量的稳健解释分数是通过交叉熵测量并在各记录间进行稳健标准化的自动编码器的每个变量重建误差;也就是说,大的重建误差对标准化的影响较小。为了评估解释分数,医学编码人员识别出异常记录中的不合理变量。然后,我们在分类和排序设置中将解释分数与医学编码人员的验证结果进行比较。作为基线,我们使用原始自动编码器的每个变量重建误差、未稳健标准化的每个变量重建误差、根据医学编码人员验证得出的不合理变量的经验频率以及变量的随机选择或排序来识别异常变量。
当我们按稳健解释分数对变量进行排序时,平均而言,排名最高的2.37个变量包含了所有不合理变量。对于基线方法,平均而言,排名最高的2.84、2.98、3.27和4.91个变量包含了所有使记录变得不合理的变量。
我们发现,基于稳健解释分数的解释在分类和排序设置中优于或等同于所检验的基线解释。由于癌症数据编码的国际标准化,我们期望我们的结果能推广到其他癌症类型和登记处。鉴于我们预计来自其他医学登记处和领域的数据中每个变量的自动编码器重建误差幅度不同,这些数据也可能受益于对每个变量的重建误差进行稳健标准化。未来的工作可以探索识别异常变量子集的方法,探讨是单个变量还是它们的组合导致了异常。这个方向旨在提高异常检测系统的可解释性和实用性。
稳健解释分数可以改进对癌症数据中不合理变量的识别解释。