Department of Computer Science, University College London, London, UK.
Proteins. 2020 Apr;88(4):616-624. doi: 10.1002/prot.25842. Epub 2019 Nov 25.
In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words." Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.
在本文中,我们使用广泛应用于自然语言处理的 Word2vec 方法,证明了在多域蛋白质的背景下,蛋白质域可能具有可学习的隐含语义“意义”,在这些蛋白质中,它们具有特定的功能贡献。Word2vec 是一组模型,可用于在固定维度向量空间中为单词或标记生成语义上有意义的嵌入。在这项工作中,我们将多域蛋白质视为“句子”,其中域标识符是可以被视为“单词”的标记。使用所有 InterPro(Finn 等人,2017)pfam 域分配,我们观察到可以使用嵌入来为 Pfam(Finn 等人,2016)的未知功能域建议推测性的 GO 分配。