Suppr超能文献

使用高通量 SELEX 数据评估蛋白质-DNA 相互作用的线性 k-mer 模型。

Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

出版信息

BMC Bioinformatics. 2013;14 Suppl 10(Suppl 10):S2. doi: 10.1186/1471-2105-14-S10-S2. Epub 2013 Aug 12.

Abstract

Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

摘要

转录因子(TF)与 DNA 的结合可以通过多种不同的方式进行建模。高度争议的是哪种建模方法是最好的,模型应该如何构建以及可以应用于哪些方面。在这项研究中,针对蛋白质结合微阵列(PBM)中预测 TF 特异性的线性 k-mer 模型应用于高通量 SELEX 数据,并研究了如何选择对结合模型最具信息量的 k-mer 的问题。我们实施了标准的交叉验证方案来减少模型中的 k-mer 数量,并观察到在不对预测准确性产生重大负面影响的情况下,通常可以大大减少 k-mer 的数量。我们还发现,随着所有蛋白质的预测准确性和循环次数的增加,SELEX 富集循环后期提供了更好的结合和未结合序列之间的区分能力。我们比较了来自相同 SELEX 数据的 k-mer 和位置特异性权重矩阵(PWM)模型的预测性能。与 PBM 数据上的先前结果一致,k-mer 模型的性能平均提高了 9%。对于 SELEX 数据集的 15 个具有中等富集循环的蛋白质,k-mer 和 PWM 的分类准确率分别平均为 71%和 62%。最后,用 SELEX 数据训练的 k-mer 模型在 ChIP-seq 数据上进行了评估,证明了一些蛋白质的显著改进。对于蛋白质 GATA1,该模型可以区分真实的 ChIP-seq 峰和负峰。对于蛋白质 RFX3 和 NFATC1,模型的性能并不比随机更好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26ab/3750486/beab9df60ee6/1471-2105-14-S10-S2-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验