Melkonian Marc, Juigné Camille, Dameron Olivier, Rabut Gwenaël, Becker Emmanuelle
Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F-35000 Rennes, France.
Univ Rennes, CNRS, IGDR - UMR 6290, F-35000 Rennes, France.
Bioinformatics. 2022 Mar 4;38(6):1685-1691. doi: 10.1093/bioinformatics/btac013.
Information on protein-protein interactions is collected in numerous primary databases with their own curation process. Several meta-databases aggregate primary databases to provide more exhaustive datasets. In addition to exhaustivity, aggregation contributes to reliability by providing an overview of the various studies and detection methods supporting an interaction. However, interactions listed in different primary databases are partly redundant because some publications reporting protein-protein interactions have been curated by multiple primary databases. Mere aggregation can thus introduce a bias if these redundancies are not identified and eliminated. To overcome this bias, meta-databases rely on the Molecular Interaction ontology that describes interaction detection methods, but they do not fully take advantage of the ontology's rich semantics, which leads to systematically overestimating interaction reproducibility.
We propose a precise definition of explicit and implicit redundancy and show that both can be easily detected using Semantic Web technologies. We apply this process to a dataset from the Agile Protein Interactomes DataServer (APID) meta-database and show that while explicit redundancies were detected by the APID aggregation process, about 15% of APID entries are implicitly redundant and should not be taken into account when presenting confidence-related metrics. More than 90% of implicit redundancies result from the aggregation of distinct primary databases, whereas the remaining occurs between entries of a single database. Finally, we build a 'reproducible interactome' with interactions that have been reproduced by multiple methods or publications. The size of the reproducible interactome is drastically impacted by removing redundancies for both yeast (-59%) and human (-56%), and we show that this is largely due to implicit redundancies.
Software, data and results are available at https://gitlab.com/nnet56/reproducible-interactome, https://reproducible-interactome.genouest.org/, Zenodo (https://doi.org/10.5281/zenodo.5595037) and NDEx (https://doi.org/10.18119/N94302 and https://doi.org/10.18119/N97S4D).
Supplementary data are available at Bioinformatics online.
蛋白质-蛋白质相互作用的信息收集在众多具有各自编目过程的原始数据库中。几个元数据库汇总原始数据库以提供更详尽的数据集。除了详尽性之外,汇总通过提供支持某种相互作用的各种研究和检测方法的概述,有助于提高可靠性。然而,不同原始数据库中列出的相互作用部分是冗余的,因为一些报道蛋白质-蛋白质相互作用的出版物已被多个原始数据库编目。如果这些冗余未被识别和消除,仅仅汇总可能会引入偏差。为了克服这种偏差,元数据库依赖于描述相互作用检测方法的分子相互作用本体,但它们没有充分利用该本体丰富的语义,这导致系统地高估了相互作用的可重复性。
我们提出了显式冗余和隐式冗余的精确定义,并表明使用语义网技术可以轻松检测到这两种冗余。我们将此过程应用于敏捷蛋白质相互作用组数据服务器(APID)元数据库的数据集,结果表明虽然APID汇总过程检测到了显式冗余,但约15%的APID条目是隐式冗余的,在呈现与置信度相关的指标时不应予以考虑。超过90%的隐式冗余来自不同原始数据库的汇总,而其余的则发生在单个数据库的条目之间。最后,我们构建了一个“可重复的相互作用组”,其中包含通过多种方法或出版物重复验证的相互作用。去除冗余后,酵母(-59%)和人类(-56%)的可重复相互作用组的规模受到了极大影响,我们表明这在很大程度上是由于隐式冗余造成的。
补充数据可在《生物信息学》在线获取。