B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain.
Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain.
Bioinformatics. 2021 May 5;37(6):845-852. doi: 10.1093/bioinformatics/btaa896.
Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels.
Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities.
The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData.
Supplementary data are available at Bioinformatics online.
网络扩散和标签传播是计算生物学中的基本工具,可应用于基因疾病关联、蛋白质功能预测和模块发现等领域。最近,由于担心网络拓扑结构可能会影响扩散分数的偏差,一些出版物在传播过程后引入了置换分析。这就提出了一个问题,即在其每个应用中,这种扩散过程的统计性质和存在偏差的情况如何。在这项工作中,我们对置换分析背后的一些常见的零模型和扩散分数的统计性质进行了特征描述。我们在三个案例研究中对七种扩散分数进行了基准测试:酵母互作网络上的合成信号、蛋白质互作网络上的模拟差异基因表达和另一个互作网络上的前瞻性基因集预测。为了清晰起见,所有数据集都基于二进制标签,但我们也为定量标签提供了理论结果。
从二进制标签开始的扩散分数受到标签编码的影响,并且表现出依赖于问题的拓扑偏差,这种偏差可以通过统计归一化来消除。参数和非参数归一化通过独立于编码和均衡偏差来解决这两个问题。我们确定并量化了两种偏差来源——均值和方差——这在归一化分数时会产生性能差异。我们为两者提供了封闭公式,并展示了零协方差与图的谱性质之间的关系。尽管没有一种提出的分数系统地优于其他分数,但在寻求的正标签与偏差不一致时,应优先进行归一化。我们的结论是,关于偏差消除的决策应该是问题和数据驱动的,即基于对偏差及其与正实体关系的定量分析。
代码可在 https://github.com/b2slab/diffuBench 上公开获取,本文所使用的数据可在 https://github.com/b2slab/retroData 上获取。
补充数据可在《生物信息学》在线获取。