Suppr超能文献

对模拟宏基因组进行反卷积:应用于宏基因组染色体构象捕获(3C)的硬聚类和软聚类算法的性能

Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C).

作者信息

DeMaere Matthew Z, Darling Aaron E

机构信息

ithree institute, University of Technology Sydney , Sydney , NSW , Australia.

出版信息

PeerJ. 2016 Nov 8;4:e2676. doi: 10.7717/peerj.2676. eCollection 2016.

Abstract

BACKGROUND

Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised.

METHODS

We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure.

RESULTS

When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance.

DISCUSSION

Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development.

摘要

背景

染色体构象捕获技术,结合如Hi-C和3C-seq等协议中的高通量DNA测序,已被提议作为一种可行的手段来生成数据,以解析自然环境中微生物的基因组。宏基因组Hi-C和3C-seq数据集已开始出现,但当样本中存在密切相关的生物体(菌株水平的多样性)时,解析基因组的可行性尚未得到系统的表征。

方法

我们开发了一个用于宏基因组3C和Hi-C测序的计算模拟管道,以评估在操作定义的物种边界之上、之下和处的基因组重建准确性。我们模拟数据集并在广泛的参数范围内测量准确性。使用扩展的B-cubed验证度量的改编版本评估了五种聚类算法(2种硬聚类,3种软聚类)。

结果

当样本中的所有基因组序列同一性低于95%时,所有测试的聚类算法都表现良好。当序列数据包含同一性高于95%的基因组(我们对菌株水平多样性的操作定义)时,Louvain方法的简单软聚类扩展实现了最高性能。

讨论

以前,只有硬聚类算法应用于宏基因组3C和Hi-C数据,但当宏基因组样本中存在菌株水平的多样性时,这些算法都表现不佳。我们对Louvain方法的简单扩展在这些情况下表现最佳,然而,准确性仍远低于没有菌株水平多样性的样本所观察到的水平。菌株分辨率也高度依赖于可用的3C序列数据量,这表明在实验设计期间必须仔细考虑测序深度。最后,似乎有很大的空间通过进一步的算法开发来提高菌株分辨率的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8c8/5103821/59594f1d82e0/peerj-04-2676-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验