一种新型的基于人工智能的计算方法，利用张量分解在整合多个Hi-C数据集时识别常见的最优区间大小。

Novel AI-powered computational method using tensor decomposition for identification of common optimal bin sizes when integrating multiple Hi-C datasets.

作者信息

Taguchi Y-H, Turki Turki

机构信息

Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.

Department of Computer Science, King Abdulaziz University, Jeddah, 21589, Saudi Arabia.

出版信息

Sci Rep. 2025 Mar 3;15(1):7459. doi: 10.1038/s41598-025-91355-8.

DOI:10.1038/s41598-025-91355-8

PMID:40033014

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11876364/

Abstract

Identifying the optimal bin sizes (or resolutions) for the integration of multiple Hi-C datasets is a challenge due to the fact that bin sizes must be common over multiple datasets. By contrast, the dependence of quality upon bin sizes can vary from dataset to dataset. Moreover, common structures should not be sought in bin sizes smaller than the optimal bin sizes, below which common structure cannot be the primary structure any more even after increasing the number of mapped short reads per bin. In this case, there are no common structures at finer resolutions, suggesting that individual Hi-C datasets may have to be analyzed separately in the bin sizes smaller than the optimal one. Thus, quality assessments of individual datasets have a limited ability to determine the best bin size for all datasets. In this study, we propose a novel application of tensor decomposition (TD) based unsupervised feature extraction (FE) to choose the optimal bin sizes for the integration of multiple Hi-C datasets. TD-based unsupervised FE exhibit phase transition-like phenomena through which the smallest possible bin size (or the highest resolution) can be automatically estimated empirically, without the need to manually set a threshold value for the integration of multiple Hi-C datasets, retrieved from GEO with GEO ID, GSE260760 and GSE255264. To our knowledge, ours is the first one that can optimize bin sizes over multiple Hi-C profiles without any tunable parameters.

摘要

由于多个Hi-C数据集整合时的bin大小必须一致，因此确定最优的bin大小（或分辨率）颇具挑战。相比之下，质量对bin大小的依赖在不同数据集之间可能有所不同。此外，在小于最优bin大小的情况下不应寻求共同结构，因为即便增加每个bin中映射短读段的数量，低于该大小的共同结构也不再是主要结构。在这种情况下，更精细分辨率下不存在共同结构，这表明对于小于最优大小的bin，可能需要分别分析各个Hi-C数据集。因此，单个数据集的质量评估在确定所有数据集的最佳bin大小方面能力有限。在本研究中，我们提出一种基于张量分解（TD）的无监督特征提取（FE）的新应用，用于为多个Hi-C数据集的整合选择最优的bin大小。基于TD的无监督FE呈现出类似相变的现象，通过该现象可以凭经验自动估计最小可能的bin大小（或最高分辨率），而无需手动设置用于整合从GEO检索到的多个Hi-C数据集（GEO ID为GSE260760和GSE255264）的阈值。据我们所知，我们是首个能够在无需任何可调参数的情况下针对多个Hi-C图谱优化bin大小的研究。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种新型的基于人工智能的计算方法，利用张量分解在整合多个Hi-C数据集时识别常见的最优区间大小。

Novel AI-powered computational method using tensor decomposition for identification of common optimal bin sizes when integrating multiple Hi-C datasets.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

一种新型的基于人工智能的计算方法，利用张量分解在整合多个Hi-C数据集时识别常见的最优区间大小。

Novel AI-powered computational method using tensor decomposition for identification of common optimal bin sizes when integrating multiple Hi-C datasets.

作者信息

机构信息

出版信息

相似文献

本文引用的文献