Vorimore Fabien, Jaudou Sandra, Tran Mai-Lan, Richard Hugues, Fach Patrick, Delannoy Sabine
ANSES, Laboratory for Food Safety, Genomics Platform IdentyPath, Maisons-Alfort, France.
ANSES, Laboratory for Food Safety, COLiPATH Unit, Maisons-Alfort, France.
Front Microbiol. 2023 May 12;14:1118158. doi: 10.3389/fmicb.2023.1118158. eCollection 2023.
The objective of this study was to develop, using a genome wide machine learning approach, an unambiguous model to predict the presence of highly pathogenic STEC in reads assemblies derived from complex samples containing potentially multiple strains. Our approach has taken into account the high genomic plasticity of and utilized the stratification of STEC and pathogroups classification based on the serotype and virulence factors to identify specific combinations of biomarkers for improved characterization of -positive STEC (also named EHEC for enterohemorrhagic ) which are associated with bloody diarrhea and hemolytic uremic syndrome (HUS) in human.
The Machine Learning (ML) approach was used in this study on a large curated dataset composed of 1,493 genome sequences and 1,178 Coding Sequences (CDS). Feature selection has been performed using eight classification algorithms, resulting in a reduction of the number of CDS to six. From this reduced dataset, the eight ML models were trained with hyper-parameter tuning and cross-validation steps.
It is remarkable that only using these six genes, EHEC can be clearly identified from read assemblies obtained from in silico mixtures and complex samples such as milk metagenomes. These various combinations of discriminative biomarkers can be implemented as novel marker genes for the unambiguous EHEC characterization from different strains mixtures as well as from raw milk metagenomes.
本研究的目的是使用全基因组机器学习方法开发一种明确的模型,以预测在源自可能包含多种菌株的复杂样本的读段组装中高致病性肠出血性大肠杆菌(STEC)的存在。我们的方法考虑到了STEC的高基因组可塑性,并利用基于血清型和毒力因子的STEC及致病菌群分类分层,来识别生物标志物的特定组合,以更好地表征与人的血性腹泻和溶血尿毒综合征(HUS)相关的产志贺毒素大肠杆菌阳性STEC(也称为肠出血性大肠杆菌EHEC)。
本研究对一个由1493个大肠杆菌基因组序列和1178个编码序列(CDS)组成的大型精选数据集使用了机器学习(ML)方法。使用八种分类算法进行了特征选择,将CDS数量减少到六个。从这个精简的数据集中,通过超参数调整和交叉验证步骤训练了八个ML模型。
值得注意的是,仅使用这六个基因,就可以从计算机模拟混合物和复杂样本(如牛奶宏基因组)获得的读段组装中清晰地鉴定出EHEC。这些具有区分性的生物标志物的各种组合可以作为新的标记基因,用于从不同的大肠杆菌菌株混合物以及生牛奶宏基因组中明确鉴定EHEC。