IEEE Trans Pattern Anal Mach Intell. 2015 Jan;37(1):41-53. doi: 10.1109/TPAMI.2014.2343973.
For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
对于大多数科学和工程问题,我们可以从多个角度获得描述观测系统的数据,并记录其各个组件的行为。异构数据集可以通过数据融合进行集体挖掘。融合可以集中在特定的目标关系上,并利用直接相关的数据以及上下文数据和关于系统约束的数据。在本文中,我们描述了一种具有惩罚矩阵三因子分解(DFMF)的数据融合方法,该方法可以同时对数据矩阵进行因子分解,以揭示隐藏的关联。该方法可以直接考虑任何可以用矩阵表示的数据,包括基于特征的表示、本体、关联和网络的数据。我们使用 11 种不同的数据源演示了 DFMF 在基因功能预测任务中的效用,并使用 6 种数据源融合来预测药物作用。我们的数据融合算法优于替代的数据集成方法,并比仅从任何单个数据源获得的准确性更高。