Suppr超能文献

利用近乎完整的 HIV 全基因组序列数据可提高模拟疫情中的系统进化重建。

Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic.

机构信息

Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK.

Wellcome Trust-Africa Centre for Health and Population Studies, University of KwaZulu-Natal, Durban, South Africa.

出版信息

Sci Rep. 2016 Dec 23;6:39489. doi: 10.1038/srep39489.

Abstract

HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA_HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

摘要

HIV 分子流行病学研究分析病毒 pol 基因序列,因为它们易于获得,但全基因组测序也可以使用其他基因。我们旨在通过分析具有已知传播树的模拟流行(作为 PANGEA_HIV 项目的一部分创建),确定哪种基因(或基因组合)可以提供最接近真实系统发育的最佳近似值。我们将一个模拟数据集(包含 4662 个序列)的不同基因( gag-pol-env、 gag-pol、 gag、 pol、 env 和部分 pol)和采样深度(100%、60%、20%和 5%)进行了子采样,并为每种情况生成了 100 个重复。我们使用 RAxML(GTR + Γ)为每个组合构建最大似然树,并使用 CompareTree 比较它们的拓扑结构与相应的真实树的拓扑结构。树的准确性与使用的序列长度显著成比例, gag-pol-env 数据集表现出最佳性能, gag 和部分 pol 序列表现出最差性能。最低的采样深度(20%和 5%)大大降低了树重建的准确性,并在重复之间显示出高度的可变性,尤其是在使用最短基因数据集时。总之,使用来自几乎整个基因组的更长序列将提高系统发育重建的可靠性。在样本覆盖率较低的情况下,结果可能会高度可变,特别是基于短序列时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6dec/5180198/9bfc32da63a0/srep39489-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验