Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel.
Department of Health Systems Management, Ben Gurion University of the Negev, Beer Sheva 8410501, Israel.
Bioinformatics. 2021 Apr 20;37(3):303-311. doi: 10.1093/bioinformatics/btaa724.
High-resolution microbial strain typing is essential for various clinical purposes, including disease outbreak investigation, tracking of microbial transmission events and epidemiological surveillance of bacterial infections. The widely used approach for multilocus sequence typing (MLST) that is based on the core genome, cgMLST, has the advantage of a high level of typeability and maximal discriminatory power. Yet, the transition from a seven loci-based scheme to cgMLST involves several challenges, that include the need by some users to maintain backward compatibility, growing difficulties in the day-to-day communication within the microbiology community with respect to nomenclature and ontology, issues with typeability, especially if a more stringent approach to loci presence is used, and computational requirements concerning laboratory data management and sharing with end-users. Hence, methods for optimizing cgMLST schemes through careful reduction of the number of loci are expected to be beneficial for practical needs in different settings.
We present a new machine learning-based methodology, minMLST, for minimizing the number of genes in cgMLST schemes by identifying subsets of informative genes and analyzing the trade-off between gene reduction and typing performance. The results achieved with minMLST over eight bacterial species show that despite the reduction in the number of genes up to a factor of 10, the typing performance remains very high and significant with an Adjusted Rand Index that ranges between 0.4 and 0.93 in different species and a P-value < 10-3. The identification of such optimized MLST schemes for bacterial strain typing is expected to improve the implementation of cgMLST by improving interlaboratory agreement and communication.
The python package minMLST is available at https://PyPi.org/project/minmlst/PyPI and supported on Linux and Windows.
Supplementary data are available at Bioinformatics online.
高分辨率微生物菌株分型对于各种临床目的至关重要,包括疾病爆发调查、微生物传播事件的跟踪以及细菌感染的流行病学监测。广泛使用的基于核心基因组的多位点序列分型(MLST)方法,cgMLST,具有高类型可操作性和最大区分力的优势。然而,从基于七个基因座的方案向 cgMLST 的转变涉及到几个挑战,包括一些用户需要保持向后兼容性,微生物学领域在命名法和本体论方面的日常交流越来越困难,类型可操作性的问题,特别是如果使用更严格的基因座存在方法,以及与实验室数据管理和与最终用户共享相关的计算要求。因此,通过仔细减少基因座的数量来优化 cgMLST 方案的方法有望满足不同环境下的实际需求。
我们提出了一种新的基于机器学习的方法 minMLST,通过识别信息基因子集并分析基因减少和分型性能之间的权衡,来最小化 cgMLST 方案中的基因数量。在八个细菌物种上取得的 minMLST 结果表明,尽管基因数量减少了 10 倍,但分型性能仍然非常高,在不同物种中的调整 Rand 指数在 0.4 到 0.93 之间,P 值<10-3。为细菌菌株分型优化此类 MLST 方案有望通过提高实验室间的一致性和沟通来改进 cgMLST 的实施。
minMLST 的 python 包可在 https://PyPi.org/project/minmlst/PyPI 上获得,并支持 Linux 和 Windows。
补充数据可在 Bioinformatics 在线获得。