Yang Ming-Ren, Wu Yu-Wei
Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, ROC.
Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan, ROC.
Comput Struct Biotechnol J. 2022 Dec 28;21:769-779. doi: 10.1016/j.csbj.2022.12.046. eCollection 2023.
Understanding genes and their underlying mechanisms is critical in deciphering how antimicrobial-resistant (AMR) bacteria withstand detrimental effects of antibiotic drugs. At the same time the genes related to AMR phenotypes may also serve as biomarkers for predicting whether a microbial strain is resistant to certain antibiotic drugs. We developed a Cross-Validated Feature Selection (CVFS) approach for robustly selecting the most parsimonious gene sets for predicting AMR activities from bacterial pan-genomes. The core idea behind the CVFS approach is interrogating features among non-overlapping sub-parts of the datasets to ensure the representativeness of the features. By randomly splitting the dataset into disjoint sub-parts, conducting feature selection within each sub-part, and intersecting the features shared by all sub-parts, the CVFS approach is able to achieve the goal of extracting the most representative features for yielding satisfactory AMR activity prediction accuracy. By testing this idea on bacterial pan-genome datasets, we showed that this approach was able to extract the most succinct feature sets that predicted AMR activities very well, indicating the potential of these genes as AMR biomarkers. The functional analysis demonstrated that the CVFS approach was able to extract both known AMR genes and novel ones, suggesting the capabilities of the algorithm in selecting relevant features and highlighting the potential of the novel genes in expanding the antimicrobial resistance gene databases.
了解基因及其潜在机制对于破译抗微生物药物耐药性(AMR)细菌如何抵御抗生素药物的有害影响至关重要。同时,与AMR表型相关的基因也可作为生物标志物,用于预测微生物菌株是否对某些抗生素药物耐药。我们开发了一种交叉验证特征选择(CVFS)方法,用于从细菌泛基因组中稳健地选择最简约的基因集,以预测AMR活性。CVFS方法背后的核心思想是在数据集的非重叠子部分中询问特征,以确保特征的代表性。通过将数据集随机拆分为不相交的子部分,在每个子部分内进行特征选择,并交叉所有子部分共享的特征,CVFS方法能够实现提取最具代表性特征的目标,从而产生令人满意的AMR活性预测准确性。通过在细菌泛基因组数据集上测试这一想法,我们表明该方法能够提取出能很好预测AMR活性的最简洁特征集,这表明这些基因作为AMR生物标志物的潜力。功能分析表明,CVFS方法能够提取已知的AMR基因和新基因,这表明该算法在选择相关特征方面的能力,并突出了新基因在扩展抗微生物耐药基因数据库方面的潜力。