Hazra Debapriya, Kim Mi-Ryung, Byun Yung-Cheol
Department of Computer Engineering, Jeju National University, Jeju 63243, Korea.
Veterinary Internal Medicine, Kyungpook National University, Daegu 41566, Korea.
Int J Mol Sci. 2022 Mar 28;23(7):3701. doi: 10.3390/ijms23073701.
Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease diagnosis in animals. This paper proposes a generative adversarial networks (GAN) model to create synthetic nucleic acid sequences of the cat genome tuned to exhibit specific desired properties. We obtained the raw sequence data from Illumina next generation sequencing. Various data preprocessing steps were performed using Cutadapt and DADA2 tools. The processed data were fed to the GAN model that was designed following the architecture of Wasserstein GAN with gradient penalty (WGAN-GP). We introduced a predictor and an evaluator in our proposed GAN model to tune the synthetic sequences to acquire certain realistic properties. The predictor was built for extracting samples with a promoter sequence, and the evaluator was built for filtering samples that scored high for motif-matching. The filtered samples were then passed to the discriminator. We evaluated our model based on multiple metrics and demonstrated outputs for latent interpolation, latent complementation, and motif-matching. Evaluation results showed our proposed GAN model achieved 93.7% correlation with the original data and produced significant outcomes as compared to existing models for sequence generation.
核酸是脱氧核糖核酸(DNA)测序的基本单位。每个生物体都展示出具有特定核苷酸的不同DNA序列。它揭示了特定DNA片段所携带的遗传信息。核酸测序表达了生物体之间的进化变化,并彻底改变了动物疾病的诊断。本文提出了一种生成对抗网络(GAN)模型,以创建经过调整以展现特定所需特性的猫基因组合成核酸序列。我们从Illumina下一代测序中获得了原始序列数据。使用Cutadapt和DADA2工具执行了各种数据预处理步骤。将处理后的数据输入到基于带有梯度惩罚的 Wasserstein GAN(WGAN-GP)架构设计的GAN模型中。我们在提出的GAN模型中引入了一个预测器和一个评估器,以调整合成序列以获得某些现实特性。构建预测器用于提取具有启动子序列的样本,构建评估器用于过滤在基序匹配中得分高的样本。然后将经过过滤的样本传递给鉴别器。我们基于多个指标评估了我们的模型,并展示了潜在插值、潜在互补和基序匹配的输出。评估结果表明,我们提出的GAN模型与原始数据的相关性达到了93.7%,并且与现有的序列生成模型相比产生了显著的结果。