Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, USA.
Bioinformatics. 2017 Jul 15;33(14):2097-2105. doi: 10.1093/bioinformatics/btx115.
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. While the development of both in vivo and in vitro profiling techniques have significantly enhanced our knowledge of transcription factor (TF)-DNA interactions, computational models of TF-DNA interactions are relatively simple and may not reveal sufficient biological insight. In particular, supervised learning based models for TF-DNA interactions attempt to map sequence-level features ( k -mers) to binding event but usually ignore the location of k -mers, which can cause data fragmentation and consequently inferior model performance.
Here, we propose a novel algorithm based on the so-called multiple-instance learning (MIL) paradigm. MIL breaks each DNA sequence into multiple overlapping subsequences and models each subsequence separately, therefore implicitly takes into consideration binding site locations, resulting in both higher accuracy and better interpretability of the models. The result from both in vivo and in vitro TF-DNA interaction data show that our approach significantly outperform conventional single-instance learning based algorithms. Importantly, the models learned from in vitro data using our approach can predict in vivo binding with very good accuracy. In addition, the location information obtained by our method provides additional insight for motif finding results from ChIP-Seq data. Finally, our approach can be easily combined with other state-of-the-art TF-DNA interaction modeling methods.
http://www.cs.utsa.edu/∼jruan/MIL/.
Supplementary data are available at Bioinformatics online.
转录调控的研究在分子生物学研究中仍然具有挑战性,但也很基础。虽然体内和体外剖析技术的发展极大地增强了我们对转录因子(TF)-DNA 相互作用的了解,但 TF-DNA 相互作用的计算模型相对简单,可能无法揭示足够的生物学见解。特别是,基于监督学习的 TF-DNA 相互作用模型试图将序列级特征(k-mer)映射到结合事件,但通常忽略 k-mer 的位置,这可能导致数据碎片化,从而导致模型性能下降。
在这里,我们提出了一种基于所谓的多实例学习(MIL)范例的新算法。MIL 将每个 DNA 序列分解为多个重叠的子序列,并分别对每个子序列进行建模,因此隐含地考虑了结合位点的位置,从而提高了模型的准确性和可解释性。来自体内和体外 TF-DNA 相互作用数据的结果表明,我们的方法明显优于传统的基于单实例学习的算法。重要的是,使用我们的方法从体外数据中学习到的模型可以非常准确地预测体内结合。此外,我们的方法获得的位置信息为 ChIP-Seq 数据中的 motif 发现结果提供了额外的见解。最后,我们的方法可以很容易地与其他最先进的 TF-DNA 相互作用建模方法结合使用。
http://www.cs.utsa.edu/∼jruan/MIL/。
补充数据可在 Bioinformatics 在线获得。