Burstein David, Zusman Tal, Degtyar Elena, Viner Ram, Segal Gil, Pupko Tal
Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv, Israel.
PLoS Pathog. 2009 Jul;5(7):e1000508. doi: 10.1371/journal.ppat.1000508. Epub 2009 Jul 10.
A large number of highly pathogenic bacteria utilize secretion systems to translocate effector proteins into host cells. Using these effectors, the bacteria subvert host cell processes during infection. Legionella pneumophila translocates effectors via the Icm/Dot type-IV secretion system and to date, approximately 100 effectors have been identified by various experimental and computational techniques. Effector identification is a critical first step towards the understanding of the pathogenesis system in L. pneumophila as well as in other bacterial pathogens. Here, we formulate the task of effector identification as a classification problem: each L. pneumophila open reading frame (ORF) was classified as either effector or not. We computationally defined a set of features that best distinguish effectors from non-effectors. These features cover a wide range of characteristics including taxonomical dispersion, regulatory data, genomic organization, similarity to eukaryotic proteomes and more. Machine learning algorithms utilizing these features were then applied to classify all the ORFs within the L. pneumophila genome. Using this approach we were able to predict and experimentally validate 40 new effectors, reaching a success rate of above 90%. Increasing the number of validated effectors to around 140, we were able to gain novel insights into their characteristics. Effectors were found to have low G+C content, supporting the hypothesis that a large number of effectors originate via horizontal gene transfer, probably from their protozoan host. In addition, effectors were found to cluster in specific genomic regions. Finally, we were able to provide a novel description of the C-terminal translocation signal required for effector translocation by the Icm/Dot secretion system. To conclude, we have discovered 40 novel L. pneumophila effectors, predicted over a hundred additional highly probable effectors, and shown the applicability of machine learning algorithms for the identification and characterization of bacterial pathogenesis determinants.
大量高致病性细菌利用分泌系统将效应蛋白转运到宿主细胞中。通过这些效应蛋白,细菌在感染过程中破坏宿主细胞的进程。嗜肺军团菌通过Icm/Dot IV型分泌系统转运效应蛋白,迄今为止,通过各种实验和计算技术已鉴定出约100种效应蛋白。效应蛋白的鉴定是理解嗜肺军团菌以及其他细菌病原体发病机制系统的关键第一步。在这里,我们将效应蛋白鉴定任务表述为一个分类问题:将每个嗜肺军团菌开放阅读框(ORF)分类为效应蛋白或非效应蛋白。我们通过计算定义了一组最能区分效应蛋白和非效应蛋白的特征。这些特征涵盖了广泛的特性,包括分类学分散性、调控数据、基因组组织、与真核蛋白质组的相似性等。然后应用利用这些特征的机器学习算法对嗜肺军团菌基因组内的所有ORF进行分类。使用这种方法,我们能够预测并通过实验验证40种新的效应蛋白,成功率超过90%。将已验证的效应蛋白数量增加到约140种后,我们能够对它们的特征有新的认识。发现效应蛋白的G+C含量较低,这支持了大量效应蛋白可能通过水平基因转移起源于其原生动物宿主的假说。此外,发现效应蛋白聚集在特定的基因组区域。最后,我们能够对Icm/Dot分泌系统转运效应蛋白所需的C端转运信号进行新的描述。总之,我们发现了40种新的嗜肺军团菌效应蛋白,预测了另外一百多种极有可能的效应蛋白,并展示了机器学习算法在细菌发病机制决定因素鉴定和表征中的适用性。