Tieri Paolo, Nardini Christine
Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Yue Yang Road 320, Shanghai, P. R. China.
Mol Biosyst. 2013 Oct;9(10):2401-7. doi: 10.1039/c3mb70242a.
issues and limitations related to accessibility, understandability and ease of use of signalling pathway databases may hamper or divert research workflow, leading, in the worst case, to the generation of confusing reference frameworks and misinterpretation of experimental results. In an attempt to retrieve signalling pathway data related to a specific set of test genes, we queried and analysed the results from six of the major curated signalling pathway databases: Reactome, PathwayCommons, KEGG, InnateDB, PID, and Wikipathways.
although we expected differences - often a desirable feature for the integration of each individual query, we observed variations of exceptional magnitude, with disproportionate quality and quantity of the results. Some of the more remarkable differences can be explained by the diverse conceptual designs and purposes of the databases, the types of data stored and the structure of the query, as well as by missing or erroneous descriptions of the search procedure. To go beyond the mere enumeration of these problems, we identified a number of operational features, in particular inner and cross coherence, which, once quantified, offer objective criteria to choose the best source of information.
in silico biology heavily relies on the information stored in databases. To ensure that computational biology mirrors biological reality and offers focused hypotheses to be experimentally validated, coherence of data codification is crucial and yet highly underestimated. We make practical recommendations for the end-user to cope with the current state of the databases as well as for the maintainers of those databases to contribute to the goal of the full enactment of the open data paradigm.
信号通路数据库在可访问性、可理解性和易用性方面存在的问题与局限,可能会妨碍或扰乱研究流程,在最坏的情况下,会导致产生令人困惑的参考框架并对实验结果产生误解。为了检索与一组特定测试基因相关的信号通路数据,我们查询并分析了六个主要的经过整理的信号通路数据库的结果:Reactome、PathwayCommons、KEGG、InnateDB、PID和Wikipathways。
尽管我们预期会存在差异——这通常是整合每个单独查询时所期望的特征,但我们观察到了异常巨大的变化,结果的质量和数量不成比例。一些更显著的差异可以通过数据库的不同概念设计和目的、所存储数据的类型、查询结构以及搜索过程中缺失或错误的描述来解释。为了不仅仅列举这些问题,我们确定了一些操作特征,特别是内部和交叉一致性,一旦对其进行量化,就可以提供选择最佳信息来源的客观标准。
计算机生物学严重依赖于存储在数据库中的信息。为确保计算生物学反映生物学现实并提供有待实验验证的重点假设,数据编码的一致性至关重要,但却被严重低估。我们为终端用户应对数据库的当前状态提出了实际建议,也为这些数据库的维护者提出建议,以助力实现开放数据范式全面实施的目标。