Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229-3026, USA.
Department of Management Science & Information Systems, Rutgers University, 252 Janice H. Levin Hall, Piscataway, NJ 08854, USA.
Biomolecules. 2011 Dec 27;2(1):1-22. doi: 10.3390/biom2010001.
Accurately predicting essential genes is important in many aspects of biology, medicine and bioengineering. In previous research, we have developed a machine learning based integrative algorithm to predict essential genes in bacterial species. This algorithm lends itself to two approaches for predicting essential genes: learning the traits from known essential genes in the target organism, or transferring essential gene annotations from a closely related model organism. However, for an understudied microbe, each approach has its potential limitations. The first is constricted by the often small number of known essential genes. The second is limited by the availability of model organisms and by evolutionary distance. In this study, we aim to determine the optimal strategy for predicting essential genes by examining four microbes with well-characterized essential genes. Our results suggest that, unless the known essential genes are few, learning from the known essential genes in the target organism usually outperforms transferring essential gene annotations from a related model organism. In fact, the required number of known essential genes is surprisingly small to make accurate predictions. In prokaryotes, when the number of known essential genes is greater than 2% of total genes, this approach already comes close to its optimal performance. In eukaryotes, achieving the same best performance requires over 4% of total genes, reflecting the increased complexity of eukaryotic organisms. Combining the two approaches resulted in an increased performance when the known essential genes are few. Our investigation thus provides key information on accurately predicting essential genes and will greatly facilitate annotations of microbial genomes.
准确预测必需基因在生物学、医学和生物工程的许多方面都很重要。在之前的研究中,我们开发了一种基于机器学习的综合算法来预测细菌物种中的必需基因。该算法适用于两种预测必需基因的方法:从目标生物中的已知必需基因中学习特征,或者从密切相关的模式生物中转移必需基因注释。然而,对于一个研究较少的微生物,每种方法都有其潜在的局限性。第一种方法受到已知必需基因数量通常较少的限制。第二种方法受到模式生物可用性和进化距离的限制。在这项研究中,我们旨在通过检查四个具有特征明确的必需基因的微生物来确定预测必需基因的最佳策略。我们的结果表明,除非已知的必需基因数量很少,否则从目标生物中的已知必需基因中学习通常优于从相关模式生物中转移必需基因注释。事实上,要进行准确的预测,所需的已知必需基因数量非常少。在原核生物中,当已知必需基因的数量超过总基因的 2%时,这种方法已经接近其最佳性能。在真核生物中,要达到相同的最佳性能,需要超过总基因的 4%,这反映了真核生物的复杂性增加。当已知必需基因数量较少时,结合这两种方法可以提高性能。因此,我们的调查为准确预测必需基因提供了关键信息,并将极大地促进微生物基因组的注释。