Computer Science Department, Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana, USA.
J Comput Biol. 2022 Jul;29(7):738-751. doi: 10.1089/cmb.2021.0640. Epub 2022 May 17.
Microbial organisms play important roles in many aspects of human health and diseases. Encouraged by the numerous studies that show the association between microbiomes and human diseases, computational and machine learning methods have been recently developed to generate and utilize microbiome features for prediction of host phenotypes such as disease versus healthy cancer immunotherapy responder versus nonresponder. We have previously developed a approach, which focuses on extraction and assembly of differential reads from metagenomic data sets that are likely sampled from differential genomes or genes between two groups of microbiome data sets (e.g., healthy vs. disease). In this article, we further improved our subtractive assembly approach by utilizing groups of k-mers with similar abundance profiles across multiple samples. We implemented a locality-sensitive hashing (LSH)-enabled approach (called kmerLSHSA) to group billions of k-mers into (kCAGs), which were subsequently used for the retrieval of kCAGs for subtractive assembly. Testing of the kmerLSHSA approach on simulated data sets and real microbiome data sets showed that, compared with the conventional approach that utilizes genes, our approach can quickly identify differential genes that can be used for building promising predictive models for microbiome-based host phenotype prediction. We also discussed other potential applications of LSH-enabled clustering of k-mers according to their abundance profiles across multiple microbiome samples.
微生物在人类健康和疾病的许多方面发挥着重要作用。受大量研究表明微生物组与人类疾病之间存在关联的鼓舞,最近已经开发出计算和机器学习方法,以生成和利用微生物组特征来预测宿主表型,例如疾病与健康、癌症免疫治疗应答者与非应答者。我们之前开发了一种方法,该方法侧重于从宏基因组数据集(可能是从两组微生物组数据集(例如,健康与疾病)之间的差异基因组或基因中采样)中提取和组装差异reads。在本文中,我们通过利用在多个样本中具有相似丰度分布的多组 k-mer 进一步改进了我们的减法组装方法。我们实现了一种基于局部敏感哈希(LSH)的方法(称为 kmerLSHSA),将数十亿个 k-mer 分组为 (kCAGs),随后用于检索用于减法组装的 kCAGs。在模拟数据集和真实微生物组数据集上对 kmerLSHSA 方法的测试表明,与利用 基因的传统方法相比,我们的方法可以快速识别可用于构建基于微生物组的宿主表型预测的有前途的预测模型的差异基因。我们还根据多个微生物组样本中它们的丰度分布讨论了 LSH 增强的 k-mer 聚类的其他潜在应用。