Zeng Min, Li Min, Fei Zhihui, Wu Fang-Xiang, Li Yaohang, Pan Yi, Wang Jianxin
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):296-305. doi: 10.1109/TCBB.2019.2897679. Epub 2021 Feb 3.
Computational methods including centrality and machine learning-based methods have been proposed to identify essential proteins for understanding the minimum requirements of the survival and evolution of a cell. In centrality methods, researchers are required to design a score function which is based on prior knowledge, yet is usually not sufficient to capture the complexity of biological information. In machine learning-based methods, some selected biological features cannot represent the complete properties of biological information as they lack a computational framework to automatically select features. To tackle these problems, we propose a deep learning framework to automatically learn biological features without prior knowledge. We use node2vec technique to automatically learn a richer representation of protein-protein interaction (PPI) network topologies than a score function. Bidirectional long short term memory cells are applied to capture non-local relationships in gene expression data. For subcellular localization information, we exploit a high dimensional indicator vector to characterize their feature. To evaluate the performance of our method, we tested it on PPI network of S. cerevisiae. Our experimental results demonstrate that the performance of our method is better than traditional centrality methods and is superior to existing machine learning-based methods. To explore which of the three types of biological information is the most vital element, we conduct an ablation study by removing each component in turn. Our results show that the PPI network embedding contributes most to the improvement. In addition, gene expression profiles and subcellular localization information are also helpful to improve the performance in identification of essential proteins.
为了理解细胞生存和进化的最低要求,人们提出了包括中心性方法和基于机器学习的方法在内的计算方法来识别必需蛋白质。在中心性方法中,研究人员需要设计一个基于先验知识的评分函数,但该函数通常不足以捕捉生物信息的复杂性。在基于机器学习的方法中,一些选定的生物学特征由于缺乏自动选择特征的计算框架,无法代表生物信息的完整属性。为了解决这些问题,我们提出了一个深度学习框架,无需先验知识即可自动学习生物学特征。我们使用node2vec技术自动学习比评分函数更丰富的蛋白质-蛋白质相互作用(PPI)网络拓扑结构表示。双向长短期记忆细胞用于捕捉基因表达数据中的非局部关系。对于亚细胞定位信息,我们利用高维指示向量来表征其特征。为了评估我们方法的性能,我们在酿酒酵母的PPI网络上对其进行了测试。我们的实验结果表明,我们方法的性能优于传统的中心性方法,并且优于现有的基于机器学习的方法。为了探索三种类型的生物信息中哪一种是最重要的元素,我们依次去除每个组件进行了消融研究。我们的结果表明,PPI网络嵌入对性能提升的贡献最大。此外,基因表达谱和亚细胞定位信息也有助于提高必需蛋白质识别的性能。