Université Grenoble Alpes, INRIA, CNRS, Grenoble INP, LJK, Grenoble, Isère France.
University of Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Lyon, Rhône France.
Bioinformatics. 2019 Sep 15;35(18):3294-3302. doi: 10.1093/bioinformatics/btz094.
The growing number of annotated biological sequences available makes it possible to learn genotype-phenotype relationships from data with increasingly high accuracy. When large quantities of labeled samples are available for training a model, convolutional neural networks can be used to predict the phenotype of unannotated sequences with good accuracy. Unfortunately, their performance with medium- or small-scale datasets is mitigated, which requires inventing new data-efficient approaches.
We introduce a hybrid approach between convolutional neural networks and kernel methods to model biological sequences. Our method enjoys the ability of convolutional neural networks to learn data representations that are adapted to a specific task, while the kernel point of view yields algorithms that perform significantly better when the amount of training data is small. We illustrate these advantages for transcription factor binding prediction and protein homology detection, and we demonstrate that our model is also simple to interpret, which is crucial for discovering predictive motifs in sequences.
Source code is freely available at https://gitlab.inria.fr/dchen/CKN-seq.
Supplementary data are available at Bioinformatics online.
越来越多的注释生物序列的出现使得从数据中学习基因型-表型关系成为可能,而且准确性越来越高。当有大量标记样本可供模型训练时,可以使用卷积神经网络以较高的准确度预测未注释序列的表型。不幸的是,当数据集规模中等或较小时,它们的性能会受到影响,这就需要发明新的数据高效方法。
我们提出了一种卷积神经网络和核方法的混合方法来对生物序列建模。我们的方法既具有卷积神经网络学习适用于特定任务的数据表示的能力,又具有核方法在训练数据量较小时表现更好的算法。我们将这些优势应用于转录因子结合预测和蛋白质同源性检测,并证明我们的模型也易于解释,这对于在序列中发现预测基序至关重要。
源代码可在 https://gitlab.inria.fr/dchen/CKN-seq 上免费获得。
补充数据可在生物信息学在线获得。