Suppr超能文献

氨基酸编码用于深度学习应用。

Amino acid encoding for deep learning applications.

机构信息

Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany.

Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.

出版信息

BMC Bioinformatics. 2020 Jun 9;21(1):235. doi: 10.1186/s12859-020-03546-x.

Abstract

BACKGROUND

The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction - a process called 'end-to-end learning' - has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN.

RESULTS

By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension.

CONCLUSION

Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.

摘要

背景

随着深度学习算法在生物信息学中的应用越来越多,它们通常比传统方法表现更好,尤其是在有更大的训练数据集可用的情况下。在深度学习应用中,离散数据,例如语言中的单词或 n-gram,或生物信息学中的氨基酸或核苷酸,通常通过嵌入矩阵表示为连续向量。最近,作为模型连续迭代的一部分,直接从数据中学习这个嵌入矩阵以优化目标预测——这个过程称为“端到端学习”——在许多领域都取得了最先进的成果。尽管在生物信息学文献中已经很好地描述了嵌入的用法,但端到端学习在单氨基酸方面的潜力,与更传统的手动 curated 编码策略相比,尚未得到系统的解决。为此,我们比较了三种广泛使用的架构(即递归神经网络 (RNN)、卷积神经网络 (CNN) 和混合 CNN-RNN)中使用的两种不同预测任务的经典编码矩阵(即 one-hot、VHSE8 和 BLOSUM62)与端到端学习的氨基酸嵌入。

结果

通过使用不同的深度学习架构,我们表明,即使在可用的训练数据有限的情况下,端到端学习与具有相同维度的经典编码对于嵌入也是相当的,并且可能允许在不损失性能的情况下减少嵌入维度,这在将模型部署到计算能力有限的设备时至关重要。我们发现嵌入维度是控制模型性能的主要因素。令人惊讶的是,我们观察到深度学习模型能够从适当维度的随机向量中学习。

结论

我们的研究表明,端到端学习是一种灵活而强大的氨基酸编码方法。此外,由于深度学习系统的灵活性,应该将氨基酸编码方案与具有相同维度的随机向量进行基准测试,以将编码方案提供的信息内容与方案提供的可区分性效果分开。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2da8/7285590/d5eb6afca03c/12859_2020_3546_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验