Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.
The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Be'er Sheva, Israel.
BMC Bioinformatics. 2022 Jun 24;23(1):253. doi: 10.1186/s12859-022-04777-w.
The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved.
We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.
人体中栖息着多种多样的共生非致病性细菌,其中许多对我们的健康至关重要。相比之下,致病性细菌有能力侵入宿主并引发疾病。对致病性和共生非致病性细菌之间的差异进行特征描述,对于发现新出现的病原体和开发新的治疗方法非常重要。以前用于将细菌分类为致病性或非致病性的方法,要么使用原始基因组读数,要么使用蛋白质家族作为特征。与使用读数相比,使用蛋白质家族提供了对生成模型的更好的可解释性。然而,基于蛋白质家族的分类器的准确性仍然可以提高。
我们开发了一种广泛适用的致病性分类器(WSPC),这是一种新的基于蛋白质含量的机器学习分类模型。我们在一个新的经过精心整理的 641 个细菌基因组数据集上训练了 WSPC,其中每个基因组都属于不同的物种。我们进行的比较分析表明,WSPC 在两个基准测试集上的表现优于现有模型。我们观察到,WSPC 中最具区分性的蛋白质家族特征在细菌物种中广泛分布。这些特征对应于细菌在感染过程中生存和复制的能力所涉及的蛋白质,而不是直接参与损害或侵入宿主的蛋白质。