Suppr超能文献

基于迭代双聚类的贝叶斯主成分分析和最小二乘法在微阵列和 RNA 测序数据中的缺失值插补。

Iterative bicluster-based Bayesian principal component analysis and least squares for missing-value imputation in microarray and RNA-sequencing data.

机构信息

Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia.

出版信息

Math Biosci Eng. 2022 Jun 16;19(9):8741-8759. doi: 10.3934/mbe.2022405.

Abstract

Microarray and RNA-sequencing (RNA-seq) techniques each produce gene expression data that can be expressed as a matrix that often contains missing values. Thus, a process of missing-value imputation that uses coherence information of the dataset is necessary. Existing imputation methods, such as iterative bicluster-based least squares (bi-iLS), use biclustering to estimate the missing values because genes are only similar under correlative experimental conditions. Also, they use the row average to obtain a temporary complete matrix, but the use of the row average is considered to be a flaw. The row average cannot reflect the real structure of the dataset because the row average only uses the information of an individual row. Therefore, we propose the use of Bayesian principal component analysis (BPCA) to obtain the temporary complete matrix instead of using the row average in bi-iLS. This alteration produces new missing values imputation method called iterative bicluster-based Bayesian principal component analysis and least squares (bi-BPCA-iLS). Several experiments have been conducted on two-dimension independent gene expression datasets, which are microarray (e.g., cell-cycle expression dataset of yeast saccharomyces cerevisiae) and RNA-seq (gene expression data from schizosaccharomyces pombe) datasets. In the case of the microarray dataset, our proposed bi-BPCA-iLS method showed a significant overall improvement in the normalized root mean square error (NRMSE) values of 10.6% from the local least squares (LLS) and 0.6% from the bi-iLS. In the case of the RNA-seq dataset, our proposed bi-BPCA-iLS method showed an overall improvement in the NRMSE values of 8.2% from the LLS and 3.1% from the bi-iLS. The additional computational time of bi-BPCA-iLS is not significant compared to bi-iLS.

摘要

微阵列和 RNA 测序 (RNA-seq) 技术各自产生可表示为矩阵的基因表达数据,该矩阵通常包含缺失值。因此,需要使用数据集的相干信息进行缺失值插补过程。现有的插补方法,如基于迭代双聚类的最小二乘法 (bi-iLS),使用双聚类来估计缺失值,因为只有在相关实验条件下基因才具有相似性。此外,它们使用行平均值来获得临时完整矩阵,但使用行平均值被认为是一个缺陷。行平均值不能反映数据集的真实结构,因为行平均值仅使用单个行的信息。因此,我们建议使用贝叶斯主成分分析 (BPCA) 来获得临时完整矩阵,而不是在 bi-iLS 中使用行平均值。这种改变产生了新的缺失值插补方法,称为基于迭代双聚类的贝叶斯主成分分析和最小二乘法 (bi-BPCA-iLS)。在二维独立基因表达数据集(如酵母酿酒酵母的细胞周期表达数据集和裂殖酵母的 RNA-seq 数据集)上进行了多项实验。在微阵列数据集的情况下,与局部最小二乘法 (LLS) 相比,我们提出的 bi-BPCA-iLS 方法在归一化均方根误差 (NRMSE) 值上的整体改进显著,提高了 10.6%;与 bi-iLS 相比,提高了 0.6%。在 RNA-seq 数据集的情况下,与 LLS 相比,我们提出的 bi-BPCA-iLS 方法在 NRMSE 值上的整体改进提高了 8.2%;与 bi-iLS 相比,提高了 3.1%。与 bi-iLS 相比,bi-BPCA-iLS 的额外计算时间并不显著。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验