Departement d'Informatique, UQAM, Montréal, QC H2X 3Y7, Canada.
Centre for Structural and Functional Genomics, Concordia University, Montréal, QC H4B 1R6, Canada.
Bioinformatics. 2022 Aug 10;38(16):3984-3991. doi: 10.1093/bioinformatics/btac420.
Precise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs.
The proposed reinforcement learning method aims to improve candidate BGCs obtained with state-of-the-art tools. It was evaluated on candidate BGCs obtained for two fungal genomes, Aspergillus niger and Aspergillus nidulans. The results highlight an improvement of the gene precision by above 15% for TOUCAN, fungiSMASH and DeepBGC; and cluster precision by above 25% for fungiSMASH and DeepBCG, allowing these tools to obtain almost perfect precision in cluster prediction. This can pave the way of optimizing current prediction of candidate BGCs in fungi, while minimizing the curation effort required by domain experts.
https://github.com/bioinfoUQAM/RL-bgc-components.
Supplementary data are available at Bioinformatics online.
精确识别生物合成基因簇 (BGC) 是一项具有挑战性的任务。BGC 发现工具的性能受到其准确预测属于候选 BGC 的成分的能力的限制,通常会高估簇边界。为了支持优化候选 BGC 的组成和边界,我们提出了一种依赖于专家 curated BGC 中的蛋白质结构域和功能注释的强化学习方法。
所提出的强化学习方法旨在改进使用最先进工具获得的候选 BGC。它在两种真菌基因组(黑曲霉和构巢曲霉)的候选 BGC 上进行了评估。结果突出了 TOUCAN、fungiSMASH 和 DeepBGC 的基因精度提高了 15%以上;fungiSMASH 和 DeepBCG 的簇精度提高了 25%以上,使这些工具能够几乎完美地预测簇。这为优化真菌中当前的候选 BGC 预测铺平了道路,同时最大限度地减少了领域专家所需的注释工作。
https://github.com/bioinfoUQAM/RL-bgc-components。
补充数据可在生物信息学在线获得。