Suppr超能文献

序列填充对深度学习模型在古菌蛋白功能预测中的性能的影响。

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction.

机构信息

B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain.

Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain.

出版信息

Sci Rep. 2020 Sep 3;10(1):14634. doi: 10.1038/s41598-020-71450-8.

Abstract

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .

摘要

近年来,将原始氨基酸序列用作深度学习模型进行蛋白质功能预测的输入已经流行起来。这种方案需要处理不同长度的蛋白质,而深度学习模型需要相同形状的输入。为了实现这一点,通常会在每个序列中添加零,直到达到一个既定的常用长度,这个过程称为零填充。然而,不同填充策略对模型性能和数据结构的影响尚不清楚。我们提出并实现了四种新的氨基酸序列填充类型。然后,我们在一个层次化的酶委员会编号预测问题中分析了不同填充氨基酸序列的方法的影响。结果表明,即使在涉及卷积层的情况下,填充也会对模型性能产生影响。与大多数主要关注架构的深度学习工作相反,本研究强调了填充这一被认为不重要的过程的相关性,并提高了对需要改进以获得更好性能的认识。该分析的代码可在 https://github.com/b2slab/padding_benchmark 上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd8e/7471694/c22958bafd17/41598_2020_71450_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验