Division of Research Informatics, Center for Informatics, City of Hope National Medical Center,Duarte, CA 91010, USA.
J Biosci. 2022;47.
The use of synthetic data is gaining an increasingly prominent role in data and machine learning workflows to build better models and conduct analyses with greater statistical inference. In the domains of healthcare and biomedical research, synthetic data may be seen in structured and unstructured formats. Concomitant with the adoption of synthetic data, a sub-discipline of machine learning known as deep learning has taken the world by storm. At a larger scale, deep learning methods tend to outperform traditional methods in regression and classification tasks. These techniques are also used in generative modeling and are thus prime candidates for generating synthetic data in both structured and unstructured formats. Here, we emphasize the generation of synthetic data in healthcare and biomedical research using deep learning methods for unstructured data formats such as text and images. Deep learning methods leverage the neural network algorithm, and in the context of generative modeling, several neural network architectures can create new synthetic data for a problem at hand including, but not limited to, recurrent neural networks (RNNs), variational autoencoders (VAEs), and generative adversarial networks (GANs). To better understand these methods, we will look at specific case studies such as generating realistic clinical notes of a patient, the generation of synthetic DNA sequences, as well as to enrich experimental data collected during the study of heterotypic cultures of cancer cells.
使用合成数据在数据和机器学习工作流程中扮演着越来越重要的角色,可用于构建更好的模型,并进行更具统计推断力的分析。在医疗保健和生物医学研究领域,可以看到合成数据以结构化和非结构化的形式出现。随着合成数据的采用,机器学习的一个分支领域——深度学习也风靡全球。在更大的范围内,深度学习方法在回归和分类任务中的表现往往优于传统方法。这些技术也用于生成式建模,因此是生成结构化和非结构化格式的合成数据的主要候选方法。在这里,我们强调使用深度学习方法生成医疗保健和生物医学研究中的非结构化数据格式(如图像和文本)的合成数据。深度学习方法利用神经网络算法,在生成式建模的背景下,几种神经网络架构可以为当前问题创建新的合成数据,包括但不限于循环神经网络 (RNN)、变分自编码器 (VAE) 和生成对抗网络 (GAN)。为了更好地理解这些方法,我们将研究具体的案例研究,例如生成患者真实的临床记录、合成 DNA 序列,以及丰富在研究癌细胞异质培养过程中收集的实验数据。