CMIC:使用可变长度 k--mer 的嵌入向量预测 CpG 岛的 DNA 甲基化遗传
CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers.
机构信息
Faculty of Design, Kyushu University, Fukuoka, Japan.
Graduate School of Design, Kyushu University, Fukuoka, Japan.
出版信息
BMC Bioinformatics. 2022 Sep 12;23(1):371. doi: 10.1186/s12859-022-04916-3.
BACKGROUND
Epigenetic modifications established in mammalian gametes are largely reprogrammed during early development, however, are partly inherited by the embryo to support its development. In this study, we examine CpG island (CGI) sequences to predict whether a mouse blastocyst CGI inherits oocyte-derived DNA methylation from the maternal genome. Recurrent neural networks (RNNs), including that based on gated recurrent units (GRUs), have recently been employed for variable-length inputs in classification and regression analyses. One advantage of this strategy is the ability of RNNs to automatically learn latent features embedded in inputs by learning their model parameters. However, the available CGI dataset applied for the prediction of oocyte-derived DNA methylation inheritance are not large enough to train the neural networks.
RESULTS
We propose a GRU-based model called CMIC (CGI Methylation Inheritance Classifier) to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range [Formula: see text] to [Formula: see text], N times, which were then used as neural network input. N was set to 1000 in the default setting. In addition, we proposed a new embedding vector generator for k-mers called splitDNA2vec. The randomness of this procedure was higher than the previous work, dna2vec.
CONCLUSIONS
We found that CMIC can predict the inheritance of oocyte-derived DNA methylation at CGIs in the maternal genome of blastocysts with a high F-measure (0.93). We also show that the F-measure can be improved by increasing the parameter N, that is, the number of sequences of variable-length k-mers derived from a single CGI sequence. This implies the effectiveness of augmenting input data by converting a DNA sequence to N sequences of variable-length k-mers. This approach can be applied to different DNA sequence classification and regression analyses, particularly those involving a small amount of data.
背景
哺乳动物配子中建立的表观遗传修饰在早期发育过程中大部分被重新编程,但部分被胚胎继承以支持其发育。在这项研究中,我们检查了 CpG 岛(CGI)序列,以预测小鼠囊胚 CGI 是否从母本基因组中继承了卵母细胞衍生的 DNA 甲基化。递归神经网络(RNN),包括基于门控循环单元(GRU)的 RNN,最近已被用于分类和回归分析中的可变长度输入。该策略的一个优点是 RNN 能够通过学习其模型参数,自动从输入中学习潜在特征。然而,可用于预测卵母细胞衍生的 DNA 甲基化遗传的可用 CGI 数据集还不够大,无法训练神经网络。
结果
我们提出了一种基于 GRU 的模型,称为 CMIC(CGI 甲基化遗传分类器),通过将 CGI 序列转换为可变长度的 k-mer 来扩充 CGI 序列,其中长度 k 从范围 [Formula: see text] 到 [Formula: see text] 中随机选择,N 次,然后将其用作神经网络输入。在默认设置中,N 设置为 1000。此外,我们提出了一种新的 k-mer 嵌入向量生成器,称为 splitDNA2vec。该过程的随机性高于以前的工作 dna2vec。
结论
我们发现 CMIC 可以以较高的 F 分数(0.93)预测囊胚母本基因组中 CGI 卵母细胞衍生 DNA 甲基化的遗传。我们还表明,通过增加参数 N,即从单个 CGI 序列衍生的可变长度 k-mer 的序列数,可以提高 F 分数。这意味着通过将 DNA 序列转换为 N 个可变长度 k-mer 序列来扩充输入数据的有效性。该方法可应用于不同的 DNA 序列分类和回归分析,特别是那些涉及少量数据的分析。