Llinares-López Felipe, Berthet Quentin, Blondel Mathieu, Teboul Olivier, Vert Jean-Philippe
Brain Team, Google Research, Paris, France.
Nat Methods. 2023 Jan;20(1):104-111. doi: 10.1038/s41592-022-01700-2. Epub 2022 Dec 15.
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
蛋白质序列比对是大多数生物信息学流程中研究蛋白质结构和功能的关键组成部分。然而,比对高度分化的序列仍然是一项艰巨的任务,当前的算法常常难以准确执行,导致许多蛋白质或开放阅读框注释不佳。在此,我们利用深度学习在语言建模和可微编程方面的最新进展,提出了DEDAL(深度嵌入和可微比对),这是一种用于比对蛋白质序列和检测同源物的灵活模型。DEDAL是一种基于机器学习的模型,它通过观察原始蛋白质序列和正确比对的大型数据集来学习比对序列。经过训练后,我们表明DEDAL在远程同源物上的比对正确性比现有方法提高了两到三倍,并且能更好地将远程同源物与进化上不相关的序列区分开来,为改善结构和功能基因组学中许多依赖序列比对的下游任务铺平了道路。