Suppr超能文献

在大型数据集中,全自动序列比对方法与传统方法相当,且速度要快得多:以乙型肝炎病毒为例。

Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus.

作者信息

Catanach Therese A, Sweet Andrew D, Nguyen Nam-Phuong D, Peery Rhiannon M, Debevec Andrew H, Thomer Andrea K, Owings Amanda C, Boyd Bret M, Katz Aron D, Soto-Adames Felipe N, Allen Julie M

机构信息

Ornithology Department, Academy of Natural Sciences of Drexel University, Philadelphia, PA, United States of America.

Illinois Natural History Survey, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America.

出版信息

PeerJ. 2019 Jan 3;7:e6142. doi: 10.7717/peerj.6142. eCollection 2019.

Abstract

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected "by eye" prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

摘要

用于系统发育分析的序列比对(多序列比对;MSA)是重要的一步,但随着近期DNA序列数据的激增,其计算成本越来越高。这些序列数据大多可公开获取,但可能极其碎片化(即全基因组和基因组片段的组合),这会使与MSA相关的计算问题变得更加复杂。传统上,比对是通过自动化算法生成的,然后在进行系统发育推断之前通过“肉眼”检查和/或校正。然而,这种人工整理在现代系统发育学所需的数据规模下效率低下,并且会导致比对结果不可重复。最近,已经开发出了用于完全自动化大型数据集比对的方法,但与结合了自动化和人工方法的更传统比对方法相比,这些方法生成的比对是否能得出兼容的系统发育树尚不清楚。在这里,我们使用来自乙型肝炎病毒(HBV)的约33000条公开可用序列(HBV是一种全球分布且快速进化的病毒)来比较不同的比对方法。我们使用了一个仅由全基因组组成的数据集和另一个也包含序列片段的数据集,比较了三种MSA方法:(1)使用传统软件的纯自动化方法,(2)包括肉眼人工编辑的自动化方法,以及(3)更新的完全自动化方法。为了了解这些方法如何影响系统发育结果,我们使用多种指标比较了基于这些不同比对方法得出的树形拓扑结构。我们进一步确定了在根据每种比对类型估计的系统发育树中以及在不同统计支持阈值下,现有HBV基因型的单系性是否得到支持。传统比对和完全自动化比对产生了相似的HBV系统发育树。尽管分支支持阈值之间存在差异,但允许较低的支持阈值往往会导致树之间的差异更大。因此,树之间的差异最好由与所使用的MSA方法无关的系统发育不确定性来解释。然而,自动化比对方法不需要人工干预,因此比传统方法耗时少得多。因此,我们得出结论,即使在极难比对的数据集里,用于MSA的完全自动化算法也与旧方法完全兼容。此外,我们发现,无论比对类型和支持阈值如何,大多数HBV诊断基因型都与进化上合理的群体不对应。这表明数据库中的基因型分类可能存在错误,或者HBV基因型可能需要修订。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bbd8/6321758/20181d4bafd2/peerj-07-6142-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验