Vitali F, Marini S, Pala D, Demartini A, Montoli S, Zambelli A, Bellazzi R
Center for Biomedical Informatics and Biostatistics, The University of Arizona, Tucson, Arizona, USA.
BIO5 Institute, The University of Arizona, Tucson, Arizona, USA.
JAMIA Open. 2018 May 14;1(1):75-86. doi: 10.1093/jamiaopen/ooy008. eCollection 2018 Jul.
Computing patients' similarity is of great interest in precision oncology since it supports clustering and subgroup identification, eventually leading to tailored therapies. The availability of large amounts of biomedical data, characterized by large feature sets and sparse content, motivates the development of new methods to compute patient similarities able to fuse heterogeneous data sources with the available knowledge.
In this work, we developed a data integration approach based on matrix trifactorization to compute patient similarities by integrating several sources of data and knowledge. We assess the accuracy of the proposed method: (1) on several synthetic data sets which similarity structures are affected by increasing levels of noise and data sparsity, and (2) on a real data set coming from an acute myeloid leukemia (AML) study. The results obtained are finally compared with the ones of traditional similarity calculation methods.
In the analysis of the synthetic data set, where the ground truth is known, we measured the capability of reconstructing the correct clusters, while in the AML study we evaluated the Kaplan-Meier curves obtained with the different clusters and measured their statistical difference by means of the log-rank test. In presence of noise and sparse data, our data integration method outperform other techniques, both in the synthetic and in the AML data.
In case of multiple heterogeneous data sources, a matrix trifactorization technique can successfully fuse all the information in a joint model. We demonstrated how this approach can be efficiently applied to discover meaningful patient similarities and therefore may be considered a reliable data driven strategy for the definition of new research hypothesis for precision oncology.
The better performance of the proposed approach presents an advantage over previous methods to provide accurate patient similarities supporting precision medicine.
计算患者相似度在精准肿瘤学中具有重要意义,因为它有助于聚类和亚组识别,最终实现个性化治疗。大量生物医学数据的存在,其特点是特征集大且内容稀疏,这推动了新方法的开发,以计算能够融合异构数据源和现有知识的患者相似度。
在这项工作中,我们开发了一种基于矩阵三分解的数据集成方法,通过整合多种数据和知识来源来计算患者相似度。我们评估了所提出方法的准确性:(1)在几个相似度结构受噪声和数据稀疏程度增加影响的合成数据集上,以及(2)在一个来自急性髓系白血病(AML)研究的真实数据集上。最后将获得的结果与传统相似度计算方法的结果进行比较。
在已知真实情况的合成数据集分析中,我们测量了重建正确聚类的能力,而在AML研究中,我们评估了用不同聚类获得的Kaplan-Meier曲线,并通过对数秩检验测量它们的统计差异。在存在噪声和稀疏数据的情况下,我们的数据集成方法在合成数据和AML数据中均优于其他技术。
在存在多个异构数据源的情况下,矩阵三分解技术可以成功地将所有信息融合到一个联合模型中。我们展示了这种方法如何能够有效地应用于发现有意义的患者相似度,因此可以被认为是一种可靠的数据驱动策略,用于为精准肿瘤学定义新的研究假设。
所提出方法的更好性能相对于以前的方法具有优势,能够提供准确的患者相似度以支持精准医学。