Department of Bioengineering, Rice University, Houston, TX, USA.
Systems, Synthetic, and Physical Biology, Rice University, Houston, TX, USA.
Nat Commun. 2021 Jul 19;12(1):4387. doi: 10.1038/s41467-021-24497-8.
Targeted high-throughput DNA sequencing is a primary approach for genomics and molecular diagnostics, and more recently as a readout for DNA information storage. Oligonucleotide probes used to enrich gene loci of interest have different hybridization kinetics, resulting in non-uniform coverage that increases sequencing costs and decreases sequencing sensitivities. Here, we present a deep learning model (DLM) for predicting Next-Generation Sequencing (NGS) depth from DNA probe sequences. Our DLM includes a bidirectional recurrent neural network that takes as input both DNA nucleotide identities as well as the calculated probability of the nucleotide being unpaired. We apply our DLM to three different NGS panels: a 39,145-plex panel for human single nucleotide polymorphisms (SNP), a 2000-plex panel for human long non-coding RNA (lncRNA), and a 7373-plex panel targeting non-human sequences for DNA information storage. In cross-validation, our DLM predicts sequencing depth to within a factor of 3 with 93% accuracy for the SNP panel, and 99% accuracy for the non-human panel. In independent testing, the DLM predicts the lncRNA panel with 89% accuracy when trained on the SNP panel. The same model is also effective at predicting the measured single-plex kinetic rate constants of DNA hybridization and strand displacement.
靶向高通量 DNA 测序是基因组学和分子诊断的主要方法,最近也可作为 DNA 信息存储的读出方式。用于富集目标基因座的寡核苷酸探针具有不同的杂交动力学特性,导致非均匀覆盖,从而增加测序成本并降低测序灵敏度。在这里,我们提出了一种用于从 DNA 探针序列预测下一代测序(NGS)深度的深度学习模型(DLM)。我们的 DLM 包括一个双向递归神经网络,它同时输入 DNA 核苷酸身份以及核苷酸未配对的计算概率。我们将我们的 DLM 应用于三个不同的 NGS 面板:用于人类单核苷酸多态性(SNP)的 39145 plex 面板,用于人类长非编码 RNA(lncRNA)的 2000plex 面板,以及针对用于 DNA 信息存储的非人类序列的 7373plex 面板。在交叉验证中,我们的 DLM 以 93%的准确度预测 SNP 面板的测序深度,以 99%的准确度预测非人类面板。在独立测试中,当在 SNP 面板上进行训练时,DLM 以 89%的准确度预测 lncRNA 面板。同一模型也可以有效地预测 DNA 杂交和链置换的测量单plex 动力学速率常数。