LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal.
BMC Bioinformatics. 2021 Jan 7;22(1):16. doi: 10.1186/s12859-020-03925-4.
Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations [Formula: see text] features [Formula: see text] contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output.
G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters.
Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric's potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.
由于能够描述固有多维和时间事件,例如生物响应、随时间的社会交互、城市动态或复杂的地球物理现象,三路数据开始流行。三聚类,三路数据的子空间聚类,能够发现与数据子空间(三聚类)相对应的模式,这些模式的值在三个维度(观测值[公式:见文本]特征[公式:见文本]上下文)上相互关联。随着越来越多的算法被提出,有效地与最先进的算法进行比较至关重要。这些比较通常使用没有已知真实值的真实数据进行,从而限制了评估。在这种情况下,我们提出了一种合成数据生成器 G-Tric,允许创建具有可配置属性和种植三聚类可能性的合成数据集。该生成器准备创建类似于生物医学和社会数据领域的真实三路数据的数据集,并且具有进一步提供真实值(三聚类解决方案)作为输出的额外优势。
G-Tric 可以复制真实世界的数据集并创建符合研究人员需求的新数据集,这些数据集在多个属性上匹配,包括数据类型(数值或符号)、维度和背景分布。用户可以调整特征化种植三聚类(子空间)的模式和结构以及它们如何相互作用(重叠)。还可以通过定义缺失、噪声或错误的数量来控制数据质量。此外,提供了类似于真实数据的数据集基准,并提供了相应的三聚类解决方案(种植三聚类)和生成参数。
使用 G-Tric 进行三聚类评估提供了结合内在和外在度量的可能性,以比较产生更可靠分析的解决方案。生成并提供了一组预定义的数据集,模仿广泛使用的三路数据并探索关键属性,突出了 G-Tric 通过简化评估新三聚类方法质量的过程来推动三聚类最新技术的潜力。