Suppr超能文献

利用互信息提高染色质可及性数据关联和可重复性的质量指标。

Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information.

机构信息

Los Alamos National Laboratory, Genomics and Bioanalytics, Los Alamos, NM, USA.

Los Alamos National Laboratory, Climate, Ecosystems, and Environmental Science, Los Alamos, NM, USA.

出版信息

BMC Bioinformatics. 2023 Nov 22;24(1):441. doi: 10.1186/s12859-023-05553-0.

Abstract

BACKGROUND

Correlation metrics are widely utilized in genomics analysis and often implemented with little regard to assumptions of normality, homoscedasticity, and independence of values. This is especially true when comparing values between replicated sequencing experiments that probe chromatin accessibility, such as assays for transposase-accessible chromatin via sequencing (ATAC-seq). Such data can possess several regions across the human genome with little to no sequencing depth and are thus non-normal with a large portion of zero values. Despite distributed use in the epigenomics field, few studies have evaluated and benchmarked how correlation and association statistics behave across ATAC-seq experiments with known differences or the effects of removing specific outliers from the data. Here, we developed a computational simulation of ATAC-seq data to elucidate the behavior of correlation statistics and to compare their accuracy under set conditions of reproducibility.

RESULTS

Using these simulations, we monitored the behavior of several correlation statistics, including the Pearson's R and Spearman's [Formula: see text] coefficients as well as Kendall's [Formula: see text] and Top-Down correlation. We also test the behavior of association measures, including the coefficient of determination R[Formula: see text], Kendall's W, and normalized mutual information. Our experiments reveal an insensitivity of most statistics, including Spearman's [Formula: see text], Kendall's [Formula: see text], and Kendall's W, to increasing differences between simulated ATAC-seq replicates. The removal of co-zeros (regions lacking mapped sequenced reads) between simulated experiments greatly improves the estimates of correlation and association. After removing co-zeros, the R[Formula: see text] coefficient and normalized mutual information display the best performance, having a closer one-to-one relationship with the known portion of shared, enhanced loci between simulated replicates. When comparing values between experimental ATAC-seq data using a random forest model, mutual information best predicts ATAC-seq replicate relationships.

CONCLUSIONS

Collectively, this study demonstrates how measures of correlation and association can behave in epigenomics experiments. We provide improved strategies for quantifying relationships in these increasingly prevalent and important chromatin accessibility assays.

摘要

背景

相关度量在基因组学分析中被广泛应用,但在实施过程中往往很少考虑到正态性、同方差性和值独立性的假设。当比较重复测序实验之间的数值时尤其如此,这些实验探测染色质可及性,例如转座酶可及染色质测序(ATAC-seq)分析。此类数据在人类基因组中可能有几个区域的测序深度很小甚至为零,因此是非正态的,并且有很大一部分值为零。尽管在表观基因组学领域广泛应用,但很少有研究评估和基准测试相关性和关联性统计在具有已知差异或从数据中删除特定异常值的情况下在 ATAC-seq 实验中的表现。在这里,我们开发了一种 ATAC-seq 数据的计算模拟,以阐明相关统计数据的行为,并在可重复的设定条件下比较它们的准确性。

结果

使用这些模拟,我们监测了几种相关统计数据的行为,包括 Pearson 的 R 和 Spearman 的 [Formula: see text] 系数以及 Kendall 的 [Formula: see text] 和自顶向下相关。我们还测试了关联度量的行为,包括决定系数 R[Formula: see text]、Kendall 的 W 和归一化互信息。我们的实验表明,大多数统计数据(包括 Spearman 的 [Formula: see text]、Kendall 的 [Formula: see text] 和 Kendall 的 W)对模拟 ATAC-seq 重复之间差异的增加不敏感。在模拟实验之间删除共同零值(缺乏映射测序读取的区域)可以大大提高相关性和关联性的估计值。删除共同零值后,R[Formula: see text]系数和归一化互信息显示出最佳性能,与模拟重复之间共享、增强的基因座的已知部分具有更接近的一一对应关系。当使用随机森林模型比较实验性 ATAC-seq 数据之间的值时,互信息可以最佳预测 ATAC-seq 重复关系。

结论

总的来说,这项研究展示了相关性和关联性度量在表观基因组学实验中的表现。我们提供了用于量化这些日益普及和重要的染色质可及性分析中关系的改进策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67c4/10664258/5255eb49ea15/12859_2023_5553_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验