Guven Emine
Department of Biomedical Engineering, Düzce University, Düzce, Turkey.
JMIR Bioinform Biotechnol. 2023 Jun 6;4:e43665. doi: 10.2196/43665.
There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).
In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.
To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.
Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.
This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.
迫切需要开发一种计算方法来分析和利用基因表达数据中包含的信息。非负矩阵分解(NMF)最近在计算生物学中的应用已证明有能力从大量数据(特别是基因表达微阵列)中获取重要细节。NMF中的一个常见问题是找到退化演示的因子的合适数量秩(r),但对于为此目的最适合使用哪种技术尚无共识。因此,已提出各种技术来选择秩分解(r)的最佳值。
在这项工作中,基于肘部方法提出了一种新的秩选择度量,并与共亲系数度量进行了系统比较。
为了确定最佳数量秩(r),本研究重点关注NMF在基因表达数据集上的单位不变拐点(UIK)方法。由于UIK方法需要一个极值距离估计器,该估计器最终用于拐点的确定和拐点的识别,因此所提出的方法以基因表达数据集作为目标矩阵,使用UIK方法找到所提出算法的残差平方和曲率的第一个拐点。
使用急性淋巴细胞白血病和急性髓细胞白血病样本的基因表达数据对UIK任务进行了计算。因此,对不同算法的NMF不同结果进行了比较。所提出的UIK方法易于执行、速度快、无需先验秩值输入,并且不需要对模型功能有重大影响的初始参数。
本研究表明,肘部方法为基因表达数据以及精确估计已知维度的模拟突变过程数据提供了可靠的预测。所提出的UIK方法比传统方法更快,包括使用一致性矩阵作为秩选择标准的度量,同时在无需对曲率进行目视检查的情况下实现了显著更好的计算效率。最后,所建议的基于肘部方法的基因表达数据秩调整方法在理论上可以说是优于共亲系数度量。