Programa de Pós-Graduação em Biotecnologia, Universidade de Caxias do Sul, Av. Francisco Getúlio Vargas, 1130, Caxias do Sul, RS, CEP 95070-560, Brazil.
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica de Yucatán, Yucatán, Mérida, Mexico.
BMC Bioinformatics. 2022 May 10;23(1):171. doi: 10.1186/s12859-022-04714-x.
Archaea are a vast and unexplored domain. Bioinformatic techniques might enlighten the path to a higher quality genome annotation in varied organisms. Promoter sequences of archaea have the action of a plethora of proteins upon it. The conservation found in a structural level of the binding site of proteins such as TBP, TFB, and TFE aids RNAP-DNA stabilization and makes the archaeal promoter prone to be explored by statistical and machine learning techniques.
In this study, experimentally verified promoter sequences of the organisms Haloferax volcanii, Sulfolobus solfataricus, and Thermococcus kodakarensis were converted into DNA duplex stability attributes (i.e. numerical variables) and were classified through Artificial Neural Networks and an in-house statistical method of classification, being tested with three forms of controls. The recognition of these promoters enabled its use to validate unannotated promoter sequences in other organisms. As a result, the binding site of basal transcription factors was located through a DNA duplex stability codification. Additionally, the classification presented satisfactory results (above 90%) among varied levels of control.
The classification models were employed to perform genomic annotation into the archaea Aciduliprofundum boonei and Thermofilum pendens, from which potential promoters have been identified and uploaded into public repositories.
古菌是一个广阔而尚未探索的领域。生物信息学技术可能会为不同生物体的高质量基因组注释开辟道路。古菌启动子序列上有大量蛋白质的作用。TBP、TFB 和 TFE 等蛋白质结合位点在结构水平上的保守性有助于 RNAP-DNA 的稳定,并使古菌启动子易于通过统计和机器学习技术进行探索。
在这项研究中,对生物体 Haloferax volcanii、Sulfolobus solfataricus 和 Thermococcus kodakarensis 的实验验证启动子序列被转化为 DNA 双链体稳定性属性(即数值变量),并通过人工神经网络和内部分类统计方法进行分类,使用三种形式的对照进行测试。这些启动子的识别使其能够用于验证其他生物体中未注释的启动子序列。结果,通过 DNA 双链体稳定性编码找到了基本转录因子的结合位点。此外,分类在不同水平的对照中均取得了令人满意的结果(超过 90%)。
分类模型被用于对古菌 Aciduliprofundum boonei 和 Thermofilum pendens 进行基因组注释,从中鉴定出潜在的启动子并上传到公共存储库中。