Lavezzo Guilherme Miura, Lauretto Marcelo de Souza, Andrioli Luiz Paulo Moura, Machado-Lima Ariane
Universidade de São Paulo, Instituto de Matemática e Estatística, Programa Interunidades de Pós-Graduação em Bioinformática, São Paulo, SP, Brazil.
Universidade de São Paulo, Escola de Artes, Ciências e Humanidades, São Paulo, SP, Brazil.
Genet Mol Biol. 2024 Jan 19;46(4):e20230048. doi: 10.1590/1678-4685-GMB-2023-0048. eCollection 2024.
Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.
转录因子结合位点(TFBS)的预测是生物信息学应用的一个例子,其中DNA分子被表示为A、C、G和T符号的序列。该问题中最常用的模型是位置权重矩阵(PWM)。尽管PWM具有简单的优点,但它无法捕捉核苷酸位置之间的依赖性,这可能会影响预测性能。无环概率有限自动机(APFA)是一种能够适应位置依赖性的替代模型。然而,APFA是一个更复杂的模型,这意味着必须学习更多的参数。在本文中,我们提出了一种创新方法,以确定位置依赖性何时会影响对PWM或APFA的偏好。这意味着使用从1106组TFBS中提取的位置依赖性特征来推断一棵决策树,该决策树能够预测对于给定的一组TFBS,哪个是最佳模型——PWM还是APFA。根据我们的结果,仅三个精确特征就能选择最佳模型,从而在性能(平均精度)和模型简单性之间取得平衡。