Suppr超能文献

测序读长差异导致微生物群落间的人工功能差异。

Artificial functional difference between microbial communities caused by length difference of sequencing reads.

作者信息

Zhang Quan, Doak Thomas G, Ye Yuzhen

机构信息

School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.

出版信息

Pac Symp Biocomput. 2012:259-70.

Abstract

Homology-based approaches are often used for the annotation of microbial communities, providing functional profiles that are used to characterize and compare the content and the functionality of microbial communities. Metagenomic reads are the starting data for these studies, however considerable differences are observed between the functional profiles-built from sequencing reads produced by different sequencing techniques-for even the same microbial community. Using simulation experiments, we show that such functional differences are likely to be caused by the actual difference in read lengths, and are not the results of a sampling bias of the sequencing techniques. Furthermore, the functional differences derived from different sequencing techniques cannot be fully explained by the read-count bias, i.e. 1) the higher fraction of unannotated shorter reads (i.e., "read length matters"), and 2) the different lengths of proteins in different functional categories. Instead, we show here that specific functional categories are under-annotated, because similarity-search-based functional annotation tools tend to miss more reads from functional categories that contain less conserved genes/proteins. In addition, the accuracy of functional annotation of short reads for different functions varies, further skewing the functional profiles. To address these issues, we present a simple yet efficient method to improve the frequency estimates of different functional categories in the functional profiles of metagenomes, based on the functional annotation of simulated reads from complete microbial genomes.

摘要

基于同源性的方法常用于微生物群落的注释,提供用于表征和比较微生物群落的内容和功能的功能概况。宏基因组读数是这些研究的起始数据,然而,即使对于相同的微生物群落,在由不同测序技术产生的测序读数构建的功能概况之间也观察到相当大的差异。通过模拟实验,我们表明这种功能差异可能是由读数长度的实际差异引起的,而不是测序技术的采样偏差的结果。此外,来自不同测序技术的功能差异不能完全由读数计数偏差来解释,即1)未注释的较短读数的比例较高(即“读数长度很重要”),以及2)不同功能类别中蛋白质的不同长度。相反,我们在此表明特定的功能类别注释不足,因为基于相似性搜索的功能注释工具往往会遗漏来自包含较少保守基因/蛋白质的功能类别的更多读数。此外,不同功能的短读数的功能注释准确性各不相同,进一步扭曲了功能概况。为了解决这些问题,我们提出了一种简单而有效的方法,基于来自完整微生物基因组的模拟读数的功能注释,来改善宏基因组功能概况中不同功能类别的频率估计。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验