Li Keyi, Yang Sen, Sullivan Travis M, Burd Randall S, Marsic Ivan
Electrical and Computer Engineering Department, Rutgers University, New Brunswick, New Jersey, USA.
Waymo, Mountain View, CA, USA.
ACM Trans Knowl Discov Data. 2024 Nov;18(9). doi: 10.1145/3687464. Epub 2024 Nov 12.
Process data constructed from event logs provides valuable insights into procedural dynamics over time. The confidential information in process data, together with the data's intricate nature, makes the datasets not sharable and challenging to collect. Consequently, research is limited using process data and analytics in the process mining domain. In this study, we introduced a synthetic process data generation task to address the limitation of sharable process data. We introduced a generative adversarial network, called ProcessGAN, to generate process data with activity sequences and corresponding timestamps. ProcessGAN consists of a transformer-based network as the generator, and a time-aware self-attention network as the discriminator. It can generate privacy-preserving process data from random noise. ProcessGAN considers the duration of the process and time intervals between activities to generate realistic activity sequences with timestamps. We evaluated ProcessGAN on five real-world datasets, two that are public and three collected in medical domains that are private. To evaluate the synthetic data, in addition to statistical metrics, we trained a supervised model to score the synthetic processes. We also used process mining to discover workflows for synthetic medical processes and had domain experts evaluate the clinical applicability of the synthetic workflows. ProcessGAN outperformed the existing generative models in generating complex processes with valid parallel pathways. The synthetic process data generated by ProcessGAN better represented the long-range dependencies between activities, a feature relevant to complicated medical and other processes. The timestamps generated by the ProcessGAN model showed similar distributions with the authentic timestamps. In addition, we trained a transformer-based network to generate synthetic contexts (e.g., patient demographics) that were associated with the synthetic processes. The synthetic contexts generated by our model outperformed the baseline models, with the distributions similar to the authentic contexts. We conclude that ProcessGAN can generate sharable synthetic process data indistinguishable from authentic data. Our source code is available in https://github.com/raaachli/ProcessGAN.
从事件日志构建的过程数据能提供随时间变化的过程动态的宝贵见解。过程数据中的机密信息,加上数据的复杂性质,使得数据集不可共享且难以收集。因此,在过程挖掘领域中,使用过程数据和分析的研究受到限制。在本研究中,我们引入了一个合成过程数据生成任务来解决可共享过程数据的局限性。我们引入了一个名为ProcessGAN的生成对抗网络,以生成具有活动序列和相应时间戳的过程数据。ProcessGAN由一个基于Transformer的网络作为生成器,以及一个时间感知自注意力网络作为判别器组成。它可以从随机噪声中生成隐私保护的过程数据。ProcessGAN考虑过程的持续时间和活动之间的时间间隔,以生成带有时间戳的逼真活动序列。我们在五个真实世界数据集上评估了ProcessGAN,其中两个是公开的,三个是在医疗领域收集的私有数据集。为了评估合成数据,除了统计指标外,我们还训练了一个监督模型来对合成过程进行评分。我们还使用过程挖掘来发现合成医疗过程的工作流程,并让领域专家评估合成工作流程的临床适用性。在生成具有有效并行路径的复杂过程方面,ProcessGAN优于现有的生成模型。ProcessGAN生成的合成过程数据更好地体现了活动之间的长期依赖关系,这一特征与复杂的医疗和其他过程相关。ProcessGAN模型生成的时间戳与真实时间戳显示出相似的分布。此外,我们训练了一个基于Transformer的网络来生成与合成过程相关的合成上下文(例如患者人口统计学信息)。我们的模型生成的合成上下文优于基线模型,其分布与真实上下文相似。我们得出结论,ProcessGAN可以生成与真实数据难以区分的可共享合成过程数据。我们的源代码可在https://github.com/raaachli/ProcessGAN上获取。