IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):1946-1955. doi: 10.1109/TCBB.2021.3073595. Epub 2022 Aug 8.
G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation, and has been associated with genomic instability, genetic diseases, and cancer progression. The experimental data produced by the G4-seq experiment provides unprecedented details on G4 formation in the genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G4 formation in new DNA sequences or whole genomes. Here, we present G4detector, a new method based on a convolutional neural network to predict G4s from DNA sequences. On top of the sequence information, we improved prediction accuracy by the addition of RNA secondary structure information. To train and test G4detector, we compiled novel high-throughput benchmarks over multiple species genomes measured by the G4-seq protocol. We show that G4detector outperforms extant methods for the same task on all benchmark datasets, can detect G4s genome-wide with high accuracy, and is able to extrapolate human-trained measurements to various non-human species. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.
四链体(G4s)是在富含鸟嘌呤的 DNA 或 RNA 序列中形成的核酸二级结构。G4 的形成会影响染色质结构和基因调控,并且与基因组不稳定性、遗传疾病和癌症进展有关。G4-seq 实验产生的实验数据提供了基因组中 G4 形成的前所未有的详细信息。然而,在整个基因组上运行实验方案是昂贵且耗时的过程。因此,非常需要有一种计算方法来预测新的 DNA 序列或整个基因组中的 G4 形成。在这里,我们提出了 G4detector,这是一种基于卷积神经网络的新方法,用于从 DNA 序列中预测 G4。除了序列信息外,我们还通过添加 RNA 二级结构信息来提高预测准确性。为了训练和测试 G4detector,我们根据 G4-seq 方案在多个物种的基因组上编译了新的高通量基准。我们表明,G4detector 在所有基准数据集上都优于现有方法,能够以高精度在全基因组范围内检测 G4,并且能够将人类训练的测量值外推到各种非人类物种。代码和基准可在 github.com/OrensteinLab/G4detector 上公开获得。