Wang Ge, Xue Min-Qi, Shen Hong-Bin, Xu Ying-Ying
School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China.
Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab539.
Location proteomics seeks to provide automated high-resolution descriptions of protein location patterns within cells. Many efforts have been undertaken in location proteomics over the past decades, thereby producing plenty of automated predictors for protein subcellular localization. However, most of these predictors are trained solely from high-throughput microscopic images or protein amino acid sequences alone. Unifying heterogeneous protein data sources has yet to be exploited. In this paper, we present a pipeline called sequence, image, network-based protein subcellular locator (SIN-Locator) that constructs a multi-view description of proteins by integrating multiple data types including images of protein expression in cells or tissues, amino acid sequences and protein-protein interaction networks, to classify the patterns of protein subcellular locations. Proteins were encoded by both handcrafted features and deep learning features, and multiple combining methods were implemented. Our experimental results indicated that optimal integrations can considerately enhance the classification accuracy, and the utility of SIN-Locator has been demonstrated through applying to new released proteins in the human protein atlas. Furthermore, we also investigate the contribution of different data sources and influence of partial absence of data. This work is anticipated to provide clues for reconciliation and combination of multi-source data for protein location analysis.
定位蛋白质组学旨在提供细胞内蛋白质定位模式的自动化高分辨率描述。在过去几十年里,定位蛋白质组学领域已经开展了许多工作,从而产生了大量用于蛋白质亚细胞定位的自动化预测工具。然而,这些预测工具大多仅基于高通量显微镜图像或蛋白质氨基酸序列进行训练。尚未对统一的异构蛋白质数据源进行开发利用。在本文中,我们提出了一种名为基于序列、图像、网络的蛋白质亚细胞定位器(SIN-Locator)的流程,该流程通过整合多种数据类型(包括细胞或组织中蛋白质表达的图像、氨基酸序列和蛋白质-蛋白质相互作用网络)来构建蛋白质的多视图描述,以对蛋白质亚细胞定位模式进行分类。蛋白质由手工特征和深度学习特征进行编码,并实施了多种组合方法。我们的实验结果表明,最优整合能够显著提高分类准确率,并且通过应用于人类蛋白质图谱中新发布的蛋白质,已证明了SIN-Locator的实用性。此外,我们还研究了不同数据源的贡献以及数据部分缺失的影响。这项工作有望为蛋白质定位分析多源数据的协调与组合提供线索。