基于模式发现与解缠的具有不平衡类别分布的临床数据的解释与预测。

Explanation and prediction of clinical data with imbalanced class distribution based on pattern discovery and disentanglement.

机构信息

Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada.

出版信息

BMC Med Inform Decis Mak. 2021 Jan 9;21(1):16. doi: 10.1186/s12911-020-01356-y.

DOI:10.1186/s12911-020-01356-y

PMID:33422088

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7796578/

Abstract

BACKGROUND

Statistical data analysis, especially the advanced machine learning (ML) methods, have attracted considerable interest in clinical practices. We are looking for interpretability of the diagnostic/prognostic results that will bring confidence to doctors, patients and their relatives in therapeutics and clinical practice. When datasets are imbalanced in diagnostic categories, we notice that the ordinary ML methods might produce results overwhelmed by the majority classes diminishing prediction accuracy. Hence, it needs methods that could produce explicit transparent and interpretable results in decision-making, without sacrificing accuracy, even for data with imbalanced groups.

METHODS

In order to interpret the clinical patterns and conduct diagnostic prediction of patients with high accuracy, we develop a novel method, Pattern Discovery and Disentanglement for Clinical Data Analysis (cPDD), which is able to discover patterns (correlated traits/indicants) and use them to classify clinical data even if the class distribution is imbalanced. In the most general setting, a relational dataset is a large table such that each column represents an attribute (trait/indicant), and each row contains a set of attribute values (AVs) of an entity (patient). Compared to the existing pattern discovery approaches, cPDD can discover a small succinct set of statistically significant high-order patterns from clinical data for interpreting and predicting the disease class of the patients even with groups small and rare.

RESULTS

Experiments on synthetic and thoracic clinical dataset showed that cPDD can 1) discover a smaller set of succinct significant patterns compared to other existing pattern discovery methods; 2) allow the users to interpret succinct sets of patterns coming from uncorrelated sources, even the groups are rare/small; and 3) obtain better performance in prediction compared to other interpretable classification approaches.

CONCLUSIONS

In conclusion, cPDD discovers fewer patterns with greater comprehensive coverage to improve the interpretability of patterns discovered. Experimental results on synthetic data validated that cPDD discovers all patterns implanted in the data, displays them precisely and succinctly with statistical support for interpretation and prediction, a capability which the traditional ML methods lack. The success of cPDD as a novel interpretable method in solving the imbalanced class problem shows its great potential to clinical data analysis for years to come.

摘要

背景

统计数据分析，尤其是先进的机器学习（ML）方法，在临床实践中引起了相当大的兴趣。我们正在寻找诊断/预后结果的可解释性，这将为医生、患者及其亲属在治疗和临床实践中带来信心。当诊断类别中的数据集不平衡时，我们注意到普通的 ML 方法可能会产生被多数类淹没的结果，从而降低预测准确性。因此，需要能够在不牺牲准确性的情况下产生明确、透明和可解释的决策结果的方法，即使对于不平衡组的数据也是如此。

方法

为了以高精度解释临床模式并对患者进行诊断预测，我们开发了一种新方法，即用于临床数据分析的模式发现与解缠结（cPDD），该方法能够发现模式（相关特征/指标）并使用它们对临床数据进行分类，即使类别分布不平衡也是如此。在最一般的设置中，关系型数据集是一个大型表，其中每列代表一个属性（特征/指标），每行包含一个实体（患者）的一组属性值（AV）。与现有的模式发现方法相比，cPDD 可以从临床数据中发现一小部分具有统计学意义的高阶模式，即使是小而罕见的组，也可以对疾病进行解释和预测。

结果

对合成和胸部临床数据集的实验表明，cPDD 可以：1）与其他现有模式发现方法相比，发现更小的简洁显著模式集；2）允许用户解释来自不相关来源的简洁模式集，即使组很小；3）与其他可解释分类方法相比，获得更好的预测性能。

结论

总之，cPDD 发现了更少的模式，具有更大的综合覆盖范围，以提高发现模式的可解释性。在合成数据上的实验结果验证了 cPDD 发现了数据中所有植入的模式，以统计支持进行了精确而简洁的显示，用于解释和预测，这是传统 ML 方法所缺乏的能力。cPDD 作为一种解决不平衡类问题的新型可解释方法的成功，表明其在未来几年内对临床数据分析具有巨大的潜力。