Australian Institute of Health Innovation, University of New South Wales, Australia.
Bioinformatics. 2011 Mar 15;27(6):791-6. doi: 10.1093/bioinformatics/btr036. Epub 2011 Jan 22.
Larger than gene structures (LGS) are DNA segments that include at least one gene and often other segments such as inverted repeats and gene promoters. Mobile genetic elements (MGE) such as integrons are LGS that play an important role in horizontal gene transfer, primarily in Gram-negative organisms. Known LGS have a profound effect on organism virulence, antibiotic resistance and other properties of the organism due to the number of genes involved. Expert-compiled grammars have been shown to be an effective computational representation of LGS, well suited to automating annotation, and supporting de novo gene discovery. However, development of LGS grammars by experts is labour intensive and restricted to known LGS.
This study uses computational grammar inference methods to automate LGS discovery. We compare the ability of six algorithms to infer LGS grammars from DNA sequences annotated with genes and other short sequences. We compared the predictive power of learned grammars against an expert-developed grammar for gene cassette arrays found in Class 1, 2 and 3 integrons, which are modular LGS containing up to 9 of about 240 cassette types.
Using a Bayesian generalization algorithm our inferred grammar was able to predict > 95% of MGE structures in a corpus of 1760 sequences obtained from Genbank (F-score 75%). Even with 100% noise added to the training and test sets, we obtained an F-score of 68%, indicating that the method is robust and has the potential to predict de novo LGS structures when the underlying gene features are known.
大于基因结构(LGS)的是包含至少一个基因的 DNA 片段,通常还包括其他片段,如反向重复和基因启动子。整合子等移动遗传元件(MGE)是 LGS,它们在水平基因转移中起着重要作用,主要在革兰氏阴性生物中。由于涉及的基因数量众多,已知的 LGS 对生物体的毒力、抗生素耐药性和其他特性有深远的影响。专家编制的语法被证明是 LGS 的有效计算表示,非常适合于自动化注释,并支持从头发现基因。然而,专家开发 LGS 语法需要大量的劳动,并且仅限于已知的 LGS。
本研究使用计算语法推断方法来自动发现 LGS。我们比较了六种算法从基因和其他短序列注释的 DNA 序列中推断 LGS 语法的能力。我们比较了学习语法的预测能力与专家开发的用于 Class 1、2 和 3 整合子中基因盒阵列的语法,整合子是含有多达 9 个约 240 种盒式类型的模块化 LGS。
使用贝叶斯泛化算法,我们推断的语法能够预测来自 Genbank 的 1760 个序列语料库中超过 95%的 MGE 结构(F 分数为 75%)。即使在训练集和测试集上添加了 100%的噪声,我们仍然获得了 68%的 F 分数,这表明该方法是稳健的,并且当已知潜在的基因特征时,有可能预测新的 LGS 结构。