Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA.
Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA.
Bioinformatics. 2021 May 5;37(6):759-766. doi: 10.1093/bioinformatics/btaa912.
The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance.
To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI.
Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI.
Supplementary data are available at Bioinformatics online.
测序技术的快速发展使我们能够从微生物群落的遗传物质中生成大量的宏基因组读数,从而深入了解不同微生物群体(如细菌、病毒、质粒等)的遗传物质之间的差异。基于 k-mer 频率的计算方法已被证明非常有效地将宏基因组测序读数分类为不同的组。然而,这些方法通常使用所有 k-mers 作为特征进行预测,而没有选择与不同序列组相关的 k-mers,即包含生物学意义的独特核苷酸模式。
为了选择具有保证错误发现率(FDR)控制的 k-mers 来区分不同组的序列,我们开发了 KIMI,这是一种基于模型-X Knockoffs 的通用框架,被认为是 FDR 控制的最新统计方法,用于具有任意目标 FDR 水平的序列基序发现,从而可以理论上保证可重复性。通过模拟研究表明,KIMI 在同时控制 FDR 和产生高功效方面非常有效,优于广泛使用的 Benjamini-Hochberg 程序和 q 值方法进行 FDR 控制。为了说明 KIMI 在分析真实数据集方面的有用性,我们以病毒基序发现问题为例,并在由病毒和细菌连续体组成的真实数据集上实现了 KIMI。我们表明,通过仅在 KIMI 选择的相关 k-mers 上训练预测模型,可以提高预测病毒和细菌连续体的准确性。
我们的 KIMI 实现可在 https://github.com/xinbaiusc/KIMI 上获得。
补充数据可在生物信息学在线获得。