Reshetnikov Kirill, Bykova Daria, Kuleshov Konstantin, Chukreev Konstantin, Guguchkin Egor, Neverov Alexey, Fedonin Gennady
Central Research Institute of Epidemiology, Moscow, Russia.
Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia.
Front Microbiol. 2025 Jun 18;16:1586476. doi: 10.3389/fmicb.2025.1586476. eCollection 2025.
Drug resistance (DR) of pathogens remains a global healthcare concern. In contrast to other bacteria, acquiring mutations in the core genome is the main mechanism of drug resistance for (MTB). For some antibiotics, the resistance of a particular isolate can be reliably predicted by identifying specific mutations, while for other antibiotics the knowledge of resistance mechanisms is limited. Statistical machine learning (ML) methods are used to infer new genes implicated in drug resistance leveraging large collections of isolates with known whole-genome sequences and phenotypic states for different drugs. However, high correlations between the phenotypic states for commonly used drugs complicate the inference of true associations of mutations with drug phenotypes by ML approaches.
Recently, several new methods have been developed to select a small subset of reliable predictors of the dependent variable, which may help reduce the number of spurious associations identified. In this study, we evaluated several such methods, namely, logistic regression with different regularization penalty functions, a recently introduced algorithm for solving the best-subset selection problem (ABESS) and "Hungry, Hungry SNPos" (HHS) a heuristic algorithm specifically developed to identify resistance-associated genetic variants in the presence of resistance co-occurrence. We assessed their ability to select known causal mutations for resistance to a specific drug while avoiding the selection of mutations in genes associated with resistance to other drugs, thus we compared selected ML models for their applicability for MTB genome wide association studies.
In our analysis, ABESS significantly outperformed the other methods, selecting more relevant sets of mutations. Additionally, we demonstrated that aggregating rare mutations within protein-coding genes into markers indicative of changes in PFAM domains improved prediction quality, and these markers were predominantly selected by ABESS, suggesting their high informativeness. However, ABESS yielded lower prediction accuracy compared to logistic regression methods with regularization.
病原体的耐药性仍然是全球医疗保健领域关注的问题。与其他细菌不同,在核心基因组中获得突变是结核分枝杆菌(MTB)产生耐药性的主要机制。对于某些抗生素,通过识别特定突变可以可靠地预测特定分离株的耐药性,而对于其他抗生素,耐药机制的相关知识则较为有限。统计机器学习(ML)方法被用于利用大量具有已知全基因组序列和不同药物表型状态的分离株集合,推断与耐药性相关的新基因。然而,常用药物表型状态之间的高度相关性使得通过ML方法推断突变与药物表型的真实关联变得复杂。
最近,已经开发了几种新方法来选择因变量的一小部分可靠预测因子,这可能有助于减少所识别的虚假关联的数量。在本研究中,我们评估了几种这样的方法,即具有不同正则化惩罚函数的逻辑回归、最近引入的用于解决最佳子集选择问题的算法(ABESS)以及“饥饿的SNPos”(HHS),这是一种专门为在存在共现耐药性的情况下识别与耐药性相关的遗传变异而开发的启发式算法。我们评估了它们选择对特定药物耐药的已知因果突变的能力,同时避免选择与对其他药物耐药相关基因中的突变,因此我们比较了所选ML模型在MTB全基因组关联研究中的适用性。
在我们的分析中,ABESS明显优于其他方法,选择了更相关的突变集。此外,我们证明将蛋白质编码基因内的罕见突变聚合成指示PFAM结构域变化的标记可提高预测质量,并且这些标记主要由ABESS选择,表明它们具有高信息量。然而,与具有正则化的逻辑回归方法相比,ABESS产生的预测准确性较低。