WhoKaryote：基于基因结构区分宏基因组中的真核生物和原核生物序列。

Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure.

机构信息

Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.

出版信息

Microb Genom. 2022 May;8(5). doi: 10.1099/mgen.0.000823.

DOI:10.1099/mgen.0.000823

PMID:35503723

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9465069/

Abstract

Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote.

摘要

宏基因组学已成为研究微生物群落中所有生物功能潜力的重要技术。大多数研究都集中在这些群落的细菌含量上，而忽略了真核微生物。事实上，许多宏基因组学分析管道默认为宏基因组中的所有基因序列都是原核生物的，这可能导致宏基因组中真核生物的注释不够准确。早期检测真核生物基因序列有助于进行真核生物的基因预测和功能注释。在这里，我们开发了一种分类器，该分类器基于基因结构方面的差异，可区分真核生物和原核生物基因序列。我们首先开发了 Whokaryote，这是一种随机森林分类器，使用基因间距离、基因密度和基因长度作为最重要的特征。我们表明，该分类器的召回率、精度和准确率估计分别为 94%、96%和 95%，该分类器基于生物学特征，可以与使用 K -mer 频率作为特征的分类器 EukRep 和 Tiara 一样出色。通过使用 Tiara 的预测作为附加特征重新训练我们的分类器，两种类型的分类器的弱点都得到了弥补；结果是 Whokaryote+Tiara，这是一个增强的分类器，在真核生物和原核生物的 F1 分数均为 0.99，性能优于所有单个分类器，同时仍然快速。在对一种具有疾病抑制作用的植物内生生境微生物群落的宏基因组数据的重新分析中，我们展示了如何使用 Whokaryote+Tiara 选择真核生物基因预测的基因序列，从而有助于发现原始研究中遗漏的几个生物合成基因簇。Whokaryote（+Tiara）已包装在一个易于安装的软件包中，并可从 https://github.com/LottePronk/whokaryote 免费获得。