关系型数据集的模式发现与解缠。

Pattern discovery and disentanglement on relational datasets.

机构信息

Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.

School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada.

出版信息

Sci Rep. 2021 Mar 11;11(1):5688. doi: 10.1038/s41598-021-84869-4.

DOI:10.1038/s41598-021-84869-4

PMID:33707478

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7952710/

Abstract

Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.

摘要

机器学习在许多类似于人类认知的应用中取得了令人瞩目的进展，能够进行识别。然而，在关系型数据集方面，特别是在数据量少、不均衡群体和标记错误的情况下，其成功受到了限制，输出结果通常缺乏透明度和可解释性。这些困难源于源级别功能和统计关系的微妙重叠和纠缠。因此，我们开发了模式发现和分解系统（PDD），它能够从各种大小、不均衡群体的数据中发现显式模式，并筛选出异常值。我们在此介绍了四个关于生物医学数据集的案例研究，以证实 PDD 的功效。它提高了预测准确性，并在一个显式表示框架 PDD 知识库中促进了发现知识的透明解释，该知识库将源、模式和个体患者联系起来。因此，PDD 有望在基因组学和生物医学机器学习中得到广泛而突破性的应用。