Suppr超能文献

在 TF 结合信号的背景下对 ENCODE 和 Cistrome 进行比较分析。

A comparative analysis of ENCODE and Cistrome in the context of TF binding signal.

机构信息

Lee Kong Chian School of Medicine, Nanyang Technological University, 9 Nanyang Drive, 636921, Singapore, Singapore.

Department of Electronics, Information and Bioengineering, Politecnico di Milano, 32 Piazza Leonardo da Vinci, 20133, Milano, Italy.

出版信息

BMC Genomics. 2024 Aug 30;25(Suppl 3):817. doi: 10.1186/s12864-024-10668-6.

Abstract

BACKGROUND

With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data.

RESULTS

We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome.

CONCLUSIONS

The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.

摘要

背景

随着公开可用的基因组数据库的兴起,科学家现在通常依赖于计算模型和预处理数据,无论是作为控制还是发现新知识。然而,不同的数据库遵循不同的原则和指南,数据处理在产生数据集的质量中起着重要作用。两个流行的转录因子结合位点数据存储库——ENCODE 和 Cistrome——以不同的方式处理相同的生物样本,它们的结果并不总是一致的。此外,处理的输出格式(BED narrowPeak)暴露了一个特征,即 signalValue,它在一致性检查中很少使用,但可以提供有关数据质量的有价值的见解。

结果

我们提供的证据表明,在人类细胞系 K562、GM12878 和 HepG2 中,signalValue 值较高(前 25%的值)的数据点在 ENCODE 和 Cistrome 之间更有可能保持一致。此外,我们表明,根据所述高值进行过滤可以提高仅基于位置信息检测转录因子相互作用的机器学习算法的预测质量。最后,我们提供了一套基于 signalValue 特征的实践和指导方针,供希望比较和合并 ENCODE 和 Cistrome 窄峰的科学家使用。

结论

signalValue 特征是一个有用的特征,可有效用于突出显示暴露该特征的不同 TF 结合位点来源之间一致的重叠区域。它的适用性扩展到基于位置的机器学习算法的下游,使其成为性能调整和数据聚合的强大工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验