Xu Ying-Ying, Zhou Hang, Murphy Robert F, Shen Hong-Bin
School of Biomedical Engineering, Southern Medical University, Guangzhou, China.
Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.
Proteins. 2021 Feb;89(2):242-250. doi: 10.1002/prot.26010. Epub 2020 Sep 26.
A major challenge for protein databases is reconciling information from diverse sources. This is especially difficult when some information consists of secondary, human-interpreted rather than primary data. For example, the Swiss-Prot database contains curated annotations of subcellular location that are based on predictions from protein sequence, statements in scientific articles, and published experimental evidence. The Human Protein Atlas (HPA) consists of millions of high-resolution microscopic images that show protein spatial distribution on a cellular and subcellular level. These images are manually annotated with protein subcellular locations by trained experts. The image annotations in HPA can capture the variation of subcellular location across different cell lines, tissues, or tissue states. Systematic investigation of the consistency between HPA and Swiss-Prot assignments of subcellular location, which is important for understanding and utilizing protein location data from the two databases, has not been described previously. In this paper, we quantitatively evaluate the consistency of subcellular location annotations between HPA and Swiss-Prot at multiple levels, as well as variation of protein locations across cell lines and tissues. Our results show that annotations of these two databases differ significantly in many cases, leading to proposed procedures for deriving and integrating the protein subcellular location data. We also find that proteins having highly variable locations are more likely to be biomarkers of diseases, providing support for incorporating analysis of subcellular location in protein biomarker identification and screening.
蛋白质数据库面临的一个主要挑战是协调来自不同来源的信息。当某些信息是二级的、人为解读而非原始数据时,这尤其困难。例如,Swiss-Prot数据库包含基于蛋白质序列预测、科学文章中的陈述以及已发表实验证据的亚细胞定位的人工注释。人类蛋白质图谱(HPA)由数百万张高分辨率显微图像组成,这些图像展示了蛋白质在细胞和亚细胞水平上的空间分布。这些图像由训练有素的专家手动标注蛋白质亚细胞定位。HPA中的图像注释可以捕捉不同细胞系、组织或组织状态下亚细胞定位的变化。此前尚未描述对HPA和Swiss-Prot亚细胞定位分配之间的一致性进行系统研究,而这种一致性对于理解和利用来自这两个数据库的蛋白质定位数据很重要。在本文中,我们在多个层面定量评估了HPA和Swiss-Prot之间亚细胞定位注释的一致性,以及蛋白质在不同细胞系和组织中的定位变化。我们的结果表明,这两个数据库的注释在许多情况下存在显著差异,从而引出了推导和整合蛋白质亚细胞定位数据的建议程序。我们还发现,定位高度可变的蛋白质更有可能是疾病的生物标志物,这为在蛋白质生物标志物识别和筛选中纳入亚细胞定位分析提供了支持。