Coelho Rafael Vieira, de Avila E Silva Scheila, Echeverrigaray Sergio, Delamare Ana Paula Longaray
Rio Grande do Sul Federal Institute of Education, Science and Technology (IFRS), Farroupilha Campus, Farroupilha, RS, Brazil.
Biotechnology Institute, University of Caxias do Sul (UCS), Caxias do Sul, RS, Brazil.
Data Brief. 2018 May 13;19:264-270. doi: 10.1016/j.dib.2018.05.025. eCollection 2018 Aug.
This paper presents a prediction of promoters using a Support Vector Machine system. In the literature, there is a lack of information on Gram-positive bacterial promoter sequences compared to Gram-negative bacteria. Promoter sequence identification is essential for studying gene expression. Initially, we collected the genome sequence from the NCBI database, and promoters were identified by their sigma factors in the DBTBS database. We then grouped the promoters according to 15 factors in 2 domains, corresponding to sigma 54 and sigma 70 of Gram-negative bacteria. Based on these data we developed a script in Python to search for promoters in the genome. After processing the data, we obtained 767 promoter sequences for , most of which were recognized by sigma SigA. To validate the data we found, we developed a software package called BacSVM+, which receives promoters as input and returns the best combination of parameters in a LibSVM library to predict promoter regions in the bacteria used in the simulation. All data gathered as well as the BacSVM+ software is available for download at http://bacpp.bioinfoucs.com/rafael/Sigmas.zip.
本文介绍了一种使用支持向量机系统对启动子进行预测的方法。在文献中,与革兰氏阴性菌相比,革兰氏阳性菌启动子序列的信息较少。启动子序列的识别对于研究基因表达至关重要。最初,我们从NCBI数据库收集基因组序列,并在DBTBS数据库中通过其sigma因子识别启动子。然后,我们根据两个结构域中的15个因子对启动子进行分组,这两个结构域分别对应革兰氏阴性菌的sigma 54和sigma 70。基于这些数据,我们用Python编写了一个脚本,用于在基因组中搜索启动子。处理数据后,我们获得了767个启动子序列,其中大部分由sigma SigA识别。为了验证我们找到的数据,我们开发了一个名为BacSVM+的软件包,该软件包以启动子作为输入,并返回LibSVM库中的最佳参数组合,以预测模拟中使用的细菌中的启动子区域。所有收集的数据以及BacSVM+软件均可从http://bacpp.bioinfoucs.com/rafael/Sigmas.zip下载。