Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, the Netherlands.
UNLOCK, Wageningen University, Wageningen, the Netherlands.
BMC Genomics. 2021 Nov 23;22(1):848. doi: 10.1186/s12864-021-08093-0.
The genus Xanthomonas has long been considered to consist predominantly of plant pathogens, but over the last decade there has been an increasing number of reports on non-pathogenic and endophytic members. As Xanthomonas species are prevalent pathogens on a wide variety of important crops around the world, there is a need to distinguish between these plant-associated phenotypes. To date a large number of Xanthomonas genomes have been sequenced, which enables the application of machine learning (ML) approaches on the genome content to predict this phenotype. Until now such approaches to the pathogenomics of Xanthomonas strains have been hampered by the fragmentation of information regarding pathogenicity of individual strains over many studies. Unification of this information into a single resource was therefore considered to be an essential step.
Mining of 39 papers considering both plant-associated phenotypes, allowed for a phenotypic classification of 578 Xanthomonas strains. For 65 plant-pathogenic and 53 non-pathogenic strains the corresponding genomes were available and de novo annotated for the presence of Pfam protein domains used as features to train and compare three ML classification algorithms; CART, Lasso and Random Forest.
The literature resource in combination with recursive feature extraction used in the ML classification algorithms provided further insights into the virulence enabling factors, but also highlighted domains linked to traits not present in pathogenic strains.
黄单胞菌属长期以来被认为主要由植物病原菌组成,但在过去十年中,越来越多的报道涉及非致病性和内生成员。由于黄单胞菌属是世界范围内多种重要作物的流行病原菌,因此需要区分这些与植物相关的表型。迄今为止,已经对大量黄单胞菌基因组进行了测序,这使得可以在基因组内容上应用机器学习 (ML) 方法来预测这种表型。到目前为止,由于单个菌株的致病性信息在许多研究中分散,因此对黄单胞菌菌株的病原体组学的此类方法受到阻碍。因此,将这些信息统一到单个资源中被认为是必不可少的一步。
挖掘了 39 篇同时考虑植物相关表型的论文,对 578 株黄单胞菌菌株进行了表型分类。对于 65 株植物病原性和 53 株非病原性菌株,相应的基因组是可用的,并对 Pfam 蛋白结构域进行了从头注释,这些结构域被用作特征来训练和比较三种 ML 分类算法;CART、Lasso 和随机森林。
文献资源与 ML 分类算法中使用的递归特征提取相结合,提供了对毒力相关因素的进一步了解,但也突出了与非致病性菌株中不存在的特征相关的结构域。