Suppr超能文献

用于在PARAFAC张量分解中插补缺失值的截尾最小二乘法

Censored Least Squares for Imputing Missing Values in PARAFAC Tensor Factorization.

作者信息

Hung Ethan C, Hodzic Enio, Tan Zhixin Cyrillus, Meyer Aaron S

机构信息

Computational and Systems Biology, University of California, Los Angeles (UCLA), USA.

Department of Bioengineering, UCLA, USA.

出版信息

bioRxiv. 2024 Jul 10:2024.07.05.602272. doi: 10.1101/2024.07.05.602272.

Abstract

Tensor factorization is a dimensionality reduction method applied to multidimensional arrays. These methods are useful for identifying patterns within a variety of biomedical datasets due to their ability to preserve the organizational structure of experiments and therefore aid in generating meaningful insights. However, missing data in the datasets being analyzed can impose challenges. Tensor factorization can be performed with some level of missing data and reconstruct a complete tensor. However, while tensor methods may impute these missing values, the choice of fitting algorithm may influence the fidelity of these imputations. Previous approaches, based on alternating least squares with prefilled values or direct optimization, suffer from introduced bias or slow computational performance. In this study, we propose that censored least squares can better handle missing values with data structured in tensor form. We ran censored least squares on four different biological datasets and compared its performance against alternating least squares with prefilled values and direct optimization. We used the error of imputation and the ability to infer masked values to benchmark their missing data performance. Censored least squares appeared best suited for the analysis of high-dimensional biological data by accuracy and convergence metrics across several studies.

摘要

张量分解是一种应用于多维数组的降维方法。这些方法对于识别各种生物医学数据集中的模式很有用,因为它们能够保留实验的组织结构,从而有助于产生有意义的见解。然而,被分析数据集中的缺失数据可能会带来挑战。张量分解可以在存在一定程度缺失数据的情况下进行,并重建一个完整的张量。然而,虽然张量方法可以估算这些缺失值,但拟合算法的选择可能会影响这些估算的保真度。以前基于带预填充值的交替最小二乘法或直接优化的方法存在引入偏差或计算性能缓慢的问题。在本研究中,我们提出截尾最小二乘法能够更好地处理张量形式的数据中的缺失值。我们在四个不同的生物数据集上运行了截尾最小二乘法,并将其性能与带预填充值的交替最小二乘法和直接优化进行了比较。我们使用估算误差和推断掩码值的能力来衡量它们处理缺失数据的性能。通过多项研究中的准确性和收敛性指标来看,截尾最小二乘法似乎最适合用于分析高维生物数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e0b/11257416/5883df766d26/nihpp-2024.07.05.602272v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验