Shepard Samuel S, Davis C Todd, Bahl Justin, Rivailler Pierre, York Ian A, Donis Ruben O
Influenza Division, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America.
Laboratory of Virus Evolution in Program of Emerging Infectious Diseases, Duke-NUS Graduate Medical School, Singapore, Singapore ; Center for Infectious Diseases, The University of Texas School of Public Health, Houston, Texas, United States of America.
PLoS One. 2014 Jan 23;9(1):e86921. doi: 10.1371/journal.pone.0086921. eCollection 2014.
The evolutionary classification of influenza genes into lineages is a first step in understanding their molecular epidemiology and can inform the subsequent implementation of control measures. We introduce a novel approach called Lineage Assignment By Extended Learning (LABEL) to rapidly determine cladistic information for any number of genes without the need for time-consuming sequence alignment, phylogenetic tree construction, or manual annotation. Instead, LABEL relies on hidden Markov model profiles and support vector machine training to hierarchically classify gene sequences by their similarity to pre-defined lineages. We assessed LABEL by analyzing the annotated hemagglutinin genes of highly pathogenic (H5N1) and low pathogenicity (H9N2) avian influenza A viruses. Using the WHO/FAO/OIE H5N1 evolution working group nomenclature, the LABEL pipeline quickly and accurately identified the H5 lineages of uncharacterized sequences. Moreover, we developed an updated clade nomenclature for the H9 hemagglutinin gene and show a similarly fast and reliable phylogenetic assessment with LABEL. While this study was focused on hemagglutinin sequences, LABEL could be applied to the analysis of any gene and shows great potential to guide molecular epidemiology activities, accelerate database annotation, and provide a data sorting tool for other large-scale bioinformatic studies.
将流感基因进化分类为不同谱系是了解其分子流行病学的第一步,并且可为后续控制措施的实施提供依据。我们引入了一种名为“通过扩展学习进行谱系分配(LABEL)”的新方法,无需耗时的序列比对、系统发育树构建或人工注释,就能快速确定任意数量基因的分支信息。相反,LABEL依靠隐马尔可夫模型概况和支持向量机训练,根据基因序列与预定义谱系的相似性对其进行分层分类。我们通过分析高致病性(H5N1)和低致病性(H9N2)甲型禽流感病毒的注释血凝素基因来评估LABEL。使用世界卫生组织/联合国粮食及农业组织/世界动物卫生组织H5N1进化工作组的命名法,LABEL流程快速且准确地鉴定出未表征序列的H5谱系。此外,我们为H9血凝素基因开发了更新的分支命名法,并展示了LABEL同样快速且可靠的系统发育评估。虽然本研究聚焦于血凝素序列,但LABEL可应用于任何基因的分析,并在指导分子流行病学活动、加速数据库注释以及为其他大规模生物信息学研究提供数据分类工具方面显示出巨大潜力。