CRIBI Biotechnology Center, University of Padova, viale G. Colombo, 3, Padova, Italy.
Department of Women's and Children's Health, University of Padova, via Giustiniani, 3, Padova, Italy.
BMC Bioinformatics. 2018 Jan 25;19(1):23. doi: 10.1186/s12859-018-2025-5.
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability.
We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods.
Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
揭示与人类疾病相关的基因是分子生物学和精准医学的一项紧迫挑战。这项任务常常受到大量候选基因和可用信息异质性的阻碍。候选基因优先级排序的计算方法有助于解决这些问题。特别是,基于核的方法是整合异构生物学知识的强大资源,然而,由于其有限的可扩展性,其实际实施常常受到限制。
我们提出了 Scuba,一种用于基因优先级排序的可扩展基于核的方法。它实现了一种新颖的基于半监督视角和边缘分布优化的多核学习方法。Scuba 经过优化,可用于处理强不平衡的情况,即已知疾病基因较少且需要大规模预测的情况。重要的是,它能够有效地处理大量的候选基因和任意数量的数据源。作为可扩展性的直接结果,Scuba 还集成了一种新的有效策略,用于为每个数据源选择最佳核参数。我们进行了交叉验证实验,并模拟了一个现实的使用场景,结果表明 Scuba 优于广泛的最新方法。
与用于基因组数据的现有基于核的方法相比,Scuba 实现了最先进的性能和增强的可扩展性。当候选基因数量较大或输入数据高度异质时,该方法可用于优先考虑候选基因。该代码可在 https://github.com/gzampieri/Scuba 上免费获得。