Hung Ethan C, Hodzic Enio, Tan Zhixin Cyrillus, Meyer Aaron S
Computational and Systems Biology, University of California, Los Angeles (UCLA), USA.
Department of Bioengineering, UCLA, USA.
bioRxiv. 2024 Jul 10:2024.07.05.602272. doi: 10.1101/2024.07.05.602272.
Tensor factorization is a dimensionality reduction method applied to multidimensional arrays. These methods are useful for identifying patterns within a variety of biomedical datasets due to their ability to preserve the organizational structure of experiments and therefore aid in generating meaningful insights. However, missing data in the datasets being analyzed can impose challenges. Tensor factorization can be performed with some level of missing data and reconstruct a complete tensor. However, while tensor methods may impute these missing values, the choice of fitting algorithm may influence the fidelity of these imputations. Previous approaches, based on alternating least squares with prefilled values or direct optimization, suffer from introduced bias or slow computational performance. In this study, we propose that censored least squares can better handle missing values with data structured in tensor form. We ran censored least squares on four different biological datasets and compared its performance against alternating least squares with prefilled values and direct optimization. We used the error of imputation and the ability to infer masked values to benchmark their missing data performance. Censored least squares appeared best suited for the analysis of high-dimensional biological data by accuracy and convergence metrics across several studies.
张量分解是一种应用于多维数组的降维方法。这些方法对于识别各种生物医学数据集中的模式很有用,因为它们能够保留实验的组织结构,从而有助于产生有意义的见解。然而,被分析数据集中的缺失数据可能会带来挑战。张量分解可以在存在一定程度缺失数据的情况下进行,并重建一个完整的张量。然而,虽然张量方法可以估算这些缺失值,但拟合算法的选择可能会影响这些估算的保真度。以前基于带预填充值的交替最小二乘法或直接优化的方法存在引入偏差或计算性能缓慢的问题。在本研究中,我们提出截尾最小二乘法能够更好地处理张量形式的数据中的缺失值。我们在四个不同的生物数据集上运行了截尾最小二乘法,并将其性能与带预填充值的交替最小二乘法和直接优化进行了比较。我们使用估算误差和推断掩码值的能力来衡量它们处理缺失数据的性能。通过多项研究中的准确性和收敛性指标来看,截尾最小二乘法似乎最适合用于分析高维生物数据。