Čopar Andrej, Žitnik Marinka, Zupan Blaž
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.
Department of Computer Science, Stanford University, Stanford, 94305 CA USA.
BioData Min. 2017 Dec 29;10:41. doi: 10.1186/s13040-017-0160-6. eCollection 2017.
Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining.
We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart.
A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.
矩阵分解是一种成熟的模式发现工具,已在生物医学数据分析中得到广泛应用,如基因表达共聚类、患者分层和基因-疾病关联挖掘。矩阵分解学习一种潜在数据模型,该模型将数据矩阵转换为潜在特征空间,从而实现泛化、噪声去除和特征发现。然而,分解算法计算量很大,因此将当前算法扩展以处理大型数据集面临紧迫挑战。本文我们关注的是矩阵三因子分解,这是一种流行的方法,不受标准矩阵分解关于数据位于一个潜在空间这一假设的限制。矩阵三因子分解通过为数据矩阵的每个维度推断一个单独的潜在空间以及推断空间之间相互作用的潜在映射来解决这个问题,使得该方法特别适合生物医学数据挖掘。
我们开发了一种用于矩阵三因子分解中潜在因子学习的分块方法。该方法将数据矩阵划分为不相交的子矩阵,这些子矩阵被独立处理并输入到并行分解系统中。所提方法的一个吸引人的特性是它与串行矩阵三因子分解在数学上等价。在对大型生物医学数据集的研究中,我们表明我们的方法在多处理器和多GPU架构上扩展性良好。在一个四GPU系统上,我们证明我们的方法比其单处理器对应方法快100倍以上。
提出了一种扩展非负矩阵三因子分解的通用方法。该方法对于在多GPU环境中实现的并行矩阵分解特别有用。我们期望新方法将在潜在因子分析的新兴过程中有用,特别是在数据集成方面,其中许多大型数据矩阵需要集体分解。