Duong Dat, Ahmad Wasi Uddin, Eskin Eleazar, Chang Kai-Wei, Li Jingyi Jessica
1 Department of Computer Science, University of California, Los Angeles, California.
2 Department of Human Genetics, and University of California, Los Angeles, California.
J Comput Biol. 2019 Jan;26(1):38-52. doi: 10.1089/cmb.2018.0093. Epub 2018 Oct 31.
The gene ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. Under this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this article, we introduce two new solutions for this problem by focusing instead on the definitions of the GO terms. We apply neural network-based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model's ability to differentiate a true protein-protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO tree-based method achieves the best classification accuracy.
基因本体论(GO)数据库包含描述基因生物学功能的GO术语。先前比较GO术语的方法依赖于GO术语被组织成树状结构这一事实。在这种范式下,树中两个GO术语的位置决定了它们的相似性得分。在本文中,我们通过关注GO术语的定义,为这个问题引入了两种新的解决方案。我们应用了自然语言处理(NLP)领域基于神经网络的技术。第一种方法不依赖于GO树,而第二种方法间接依赖于GO树。在我们的第一种方法中,我们将两个GO定义视为两个无序的单词集合来进行比较。单词相似度由一个将单词映射到N维空间的词嵌入模型来估计。在我们的第二种方法中,我们考虑句子中的单词顺序。我们使用句子编码器将GO定义嵌入到向量中,并估计一个定义蕴含另一个定义的可能性。我们通过两种方式验证我们的方法。在第一个实验中,我们测试模型区分真实蛋白质-蛋白质网络和随机生成网络的能力。在第二个实验中,我们测试模型从人类、小鼠和果蝇中随机匹配的基因中识别直系同源基因的能力。在这两个实验中,NLP和基于GO树的方法的混合方法都取得了最佳的分类准确率。