Casado-Vela Juan, Matthiesen Rune, Sellés Susana, Naranjo José Ramón
Spanish National Research Council (CSIC) - Spanish National Biotechnology Centre (CNB), Darwin 3, Cantoblanco, 28049 Madrid, Spain.
Institute of Molecular Pathology and Immunology (IPATIMUP), University of Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal.
Proteomes. 2013 May 31;1(1):3-24. doi: 10.3390/proteomes1010003.
Understanding protein interaction networks and their dynamic changes is a major challenge in modern biology. Currently, several experimental and approaches allow the screening of protein interactors in a large-scale manner. Therefore, the bulk of information on protein interactions deposited in databases and peer-reviewed published literature is constantly growing. Multiple databases interfaced from user-friendly web tools recently emerged to facilitate the task of protein interaction data retrieval and data integration. Nevertheless, as we evidence in this report, despite the current efforts towards data integration, the quality of the information on protein interactions retrieved by approaches is frequently incomplete and may even list false interactions. Here we point to some obstacles precluding confident data integration, with special emphasis on protein interactions, which include gene acronym redundancies and protein synonyms. Three human proteins (choline kinase, PPIase and uromodulin) and three different web-based data search engines focused on protein interaction data retrieval (PSICQUIC, DASMI and BIPS) were used to explain the potential occurrence of undesired errors that should be considered by researchers in the field. We demonstrate that, despite the recent initiatives towards data standardization, manual curation of protein interaction networks based on literature searches are still required to remove potential false positives. A three-step workflow consisting of: (i) data retrieval from multiple databases, (ii) peer-reviewed literature searches, and (iii) data curation and integration, is proposed as the best strategy to gather updated information on protein interactions. Finally, this strategy was applied to compile information on human DREAM protein interactome, which constitutes liable training datasets that can be used to improve computational predictions.
理解蛋白质相互作用网络及其动态变化是现代生物学中的一项重大挑战。目前,有几种实验方法允许大规模筛选蛋白质相互作用体。因此,数据库中存储的以及同行评审发表文献中的大量蛋白质相互作用信息正在不断增长。最近出现了多个通过用户友好的网络工具进行接口的数据库,以促进蛋白质相互作用数据检索和数据集成任务。然而,正如我们在本报告中所证明的,尽管目前在进行数据集成方面做出了努力,但通过这些方法检索到的蛋白质相互作用信息质量往往不完整,甚至可能列出错误的相互作用。在这里,我们指出了一些阻碍可靠数据集成的障碍,特别强调了蛋白质相互作用方面的障碍,其中包括基因首字母缩写冗余和蛋白质同义词。使用三种人类蛋白质(胆碱激酶、肽基脯氨酰异构酶和尿调节蛋白)以及三种专注于蛋白质相互作用数据检索的基于网络的数据搜索引擎(PSICQUIC、DASMI和BIPS)来解释该领域研究人员应考虑的潜在不期望错误的发生情况。我们证明,尽管最近有数据标准化的举措,但基于文献检索对蛋白质相互作用网络进行人工整理仍然是消除潜在假阳性所必需的。提出了一个由三步组成的工作流程:(i)从多个数据库检索数据,(ii)进行同行评审文献检索,以及(iii)数据整理和集成,作为收集蛋白质相互作用最新信息的最佳策略。最后,将该策略应用于汇编人类DREAM蛋白质相互作用组的信息,这些信息构成了可靠的训练数据集,可用于改进计算预测。