Department of Computer Science, Faculty of Natural Sciences.
The Shraga Segal Department of Microbiology Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, BeerSheva, Israel.
Bioinformatics. 2019 Jun 1;35(12):2001-2008. doi: 10.1093/bioinformatics/bty928.
Bacterial infections are a major cause of illness worldwide. However, most bacterial strains pose no threat to human health and may even be beneficial. Thus, developing powerful diagnostic bioinformatic tools that differentiate pathogenic from commensal bacteria are critical for effective treatment of bacterial infections.
We propose a machine-learning approach for classifying human-hosted bacteria as pathogenic or non-pathogenic based on their genome-derived proteomes. Our approach is based on sparse Support Vector Machines (SVM), which autonomously selects a small set of genes that are related to bacterial pathogenicity. We implement our approach as a tool-'Bacterial Pathogenicity Classification via sparse-SVM' (BacPaCS)-which is fully automated and handles datasets significantly larger than those previously used. BacPaCS shows high accuracy in distinguishing pathogenic from non-pathogenic bacteria, in a clinically relevant dataset, comprising only human-hosted bacteria. Among the genes that received the highest positive weight in the resulting classifier, we found genes that are known to be related to bacterial pathogenicity, in addition to novel candidates, whose involvement in bacterial virulence was never reported.
The code and the resulting model are available at: https://github.com/barashe/bacpacs.
Supplementary data are available at Bioinformatics online.
细菌感染是全球范围内疾病的主要原因。然而,大多数细菌菌株对人类健康没有威胁,甚至可能有益。因此,开发强大的诊断生物信息学工具,区分致病和共生细菌,对于有效治疗细菌感染至关重要。
我们提出了一种基于基因组衍生蛋白质组将人类宿主细菌分类为致病性或非致病性的机器学习方法。我们的方法基于稀疏支持向量机(SVM),它自动选择与细菌致病性相关的一小部分基因。我们将我们的方法实现为一个工具-"通过稀疏 SVM 进行细菌致病性分类"(BacPaCS)-它是完全自动化的,并处理比以前使用的数据集大得多的数据集。BacPaCS 在区分致病性和非致病性细菌方面表现出很高的准确性,在一个仅包含人类宿主细菌的临床相关数据集中。在分类器中获得最高正权重的基因中,我们发现了一些已知与细菌致病性相关的基因,以及一些从未报道过与细菌毒力有关的新候选基因。
代码和生成的模型可在以下网址获得:https://github.com/barashe/bacpacs。
补充数据可在生物信息学在线获得。