Suppr超能文献

一种新型的基于人工智能的计算方法,利用张量分解在整合多个Hi-C数据集时识别常见的最优区间大小。

Novel AI-powered computational method using tensor decomposition for identification of common optimal bin sizes when integrating multiple Hi-C datasets.

作者信息

Taguchi Y-H, Turki Turki

机构信息

Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.

Department of Computer Science, King Abdulaziz University, Jeddah, 21589, Saudi Arabia.

出版信息

Sci Rep. 2025 Mar 3;15(1):7459. doi: 10.1038/s41598-025-91355-8.

Abstract

Identifying the optimal bin sizes (or resolutions) for the integration of multiple Hi-C datasets is a challenge due to the fact that bin sizes must be common over multiple datasets. By contrast, the dependence of quality upon bin sizes can vary from dataset to dataset. Moreover, common structures should not be sought in bin sizes smaller than the optimal bin sizes, below which common structure cannot be the primary structure any more even after increasing the number of mapped short reads per bin. In this case, there are no common structures at finer resolutions, suggesting that individual Hi-C datasets may have to be analyzed separately in the bin sizes smaller than the optimal one. Thus, quality assessments of individual datasets have a limited ability to determine the best bin size for all datasets. In this study, we propose a novel application of tensor decomposition (TD) based unsupervised feature extraction (FE) to choose the optimal bin sizes for the integration of multiple Hi-C datasets. TD-based unsupervised FE exhibit phase transition-like phenomena through which the smallest possible bin size (or the highest resolution) can be automatically estimated empirically, without the need to manually set a threshold value for the integration of multiple Hi-C datasets, retrieved from GEO with GEO ID, GSE260760 and GSE255264. To our knowledge, ours is the first one that can optimize bin sizes over multiple Hi-C profiles without any tunable parameters.

摘要

由于多个Hi-C数据集整合时的bin大小必须一致,因此确定最优的bin大小(或分辨率)颇具挑战。相比之下,质量对bin大小的依赖在不同数据集之间可能有所不同。此外,在小于最优bin大小的情况下不应寻求共同结构,因为即便增加每个bin中映射短读段的数量,低于该大小的共同结构也不再是主要结构。在这种情况下,更精细分辨率下不存在共同结构,这表明对于小于最优大小的bin,可能需要分别分析各个Hi-C数据集。因此,单个数据集的质量评估在确定所有数据集的最佳bin大小方面能力有限。在本研究中,我们提出一种基于张量分解(TD)的无监督特征提取(FE)的新应用,用于为多个Hi-C数据集的整合选择最优的bin大小。基于TD的无监督FE呈现出类似相变的现象,通过该现象可以凭经验自动估计最小可能的bin大小(或最高分辨率),而无需手动设置用于整合从GEO检索到的多个Hi-C数据集(GEO ID为GSE260760和GSE255264)的阈值。据我们所知,我们是首个能够在无需任何可调参数的情况下针对多个Hi-C图谱优化bin大小的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6181/11876364/89d96cdc474b/41598_2025_91355_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验