Li Hao, Zhang ShiQi, Chen Lei, Pan Xiaoyong, Li ZhanDong, Huang Tao, Cai Yu-Dong
College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China.
Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark.
Front Genet. 2022 May 16;13:909040. doi: 10.3389/fgene.2022.909040. eCollection 2022.
In current biology, exploring the biological functions of proteins is important. Given the large number of proteins in some organisms, exploring their functions one by one through traditional experiments is impossible. Therefore, developing quick and reliable methods for identifying protein functions is necessary. Considerable accumulation of protein knowledge and recent developments on computer science provide an alternative way to complete this task, that is, designing computational methods. Several efforts have been made in this field. Most previous methods have adopted the protein sequence features or directly used the linkage from a protein-protein interaction (PPI) network. In this study, we proposed some novel multi-label classifiers, which adopted new embedding features to represent proteins. These features were derived from functional domains and a PPI network via word embedding and network embedding, respectively. The minimum redundancy maximum relevance method was used to assess the features, generating a feature list. Incremental feature selection, incorporating RAndom k-labELsets to construct multi-label classifiers, used such list to construct two optimum classifiers, corresponding to two key measurements: accuracy and exact match. These two classifiers had good performance, and they were superior to classifiers that used features extracted by traditional methods.
在当代生物学中,探索蛋白质的生物学功能至关重要。鉴于某些生物体中蛋白质数量众多,通过传统实验逐一探索其功能是不可能的。因此,开发快速且可靠的蛋白质功能识别方法很有必要。蛋白质知识的大量积累以及计算机科学的最新进展提供了完成这项任务的另一种方式,即设计计算方法。在这一领域已经做出了一些努力。大多数先前的方法采用了蛋白质序列特征,或者直接利用蛋白质 - 蛋白质相互作用(PPI)网络中的联系。在本研究中,我们提出了一些新颖的多标签分类器,它们采用新的嵌入特征来表示蛋白质。这些特征分别通过词嵌入和网络嵌入从功能域和PPI网络中衍生而来。使用最小冗余最大相关性方法评估这些特征,生成一个特征列表。增量特征选择结合随机k标签集来构建多标签分类器,使用该列表构建两个最优分类器,分别对应两个关键度量:准确率和精确匹配。这两个分类器表现良好,并且优于使用传统方法提取的特征的分类器。