Lourenço José, McNaughton Anna L, Pley Caitlin, Obolski Uri, Gupta Sunetra, Matthews Philippa C
BioISI (Biosystems and Integrative Sciences Institute), Faculty of Sciences, University of Lisbon, Campo Grande, Lisbon 1749-016, Portugal.
Population Health Science, Bristol Medical School, University of Bristo, 5 Tyndall Ave, Bristol BS81UDBS8, UK.
Virus Evol. 2022 Dec 10;9(1):veac116. doi: 10.1093/ve/veac116. eCollection 2023.
Hepatitis B viruses (HBVs) are compact viruses with circular genomes of ∼3.2 kb in length. Four genes () generating seven products are encoded on overlapping reading frames. Ten HBV genotypes have been characterised (A-J), which may account for differences in transmission, outcomes of infection, and treatment response. However, HBV genotyping is rarely undertaken, and sequencing remains inaccessible in many settings. We set out to assess which amino acid (aa) sites in the HBV genome are most informative for determining genotype, using a machine learning approach based on random forest algorithms (RFA). We downloaded 5,496 genome-length HBV sequences from a public database, excluding recombinant sequences, regions with conserved indels, and genotypes I and J. Each gene was separately translated into aa, and the proteins concatenated into a single sequence (length 1,614 aa). Using RFA, we searched for aa sites predictive of genotype and assessed covariation among the sites with a mutual information-based method. We were able to discriminate confidently between genotypes A-H using ten aa sites. Half of these sites (5/10) sites were identified in Polymerase (Pol), of which 4/5 were in the spacer domain and one in reverse transcriptase. A further 4/10 sites were located in Surface protein and a single site in HBx. There were no informative sites in Core. Properties of the aa were generally not conserved between genotypes at informative sites. Among the highest co-varying pairs of sites, there were fifty-five pairs that included one of these 'top ten' sites. Overall, we have shown that RFA analysis is a powerful tool for identifying aa sites that predict the HBV lineage, with an unexpectedly high number of such sites in the spacer domain, which has conventionally been viewed as unimportant for structure or function. Our results improve ease of genotype prediction from limited regions of HBV sequences and may have future applications in understanding HBV evolution.
乙型肝炎病毒(HBV)是一种紧凑的病毒,其环状基因组长度约为3.2 kb。四个基因()产生七种产物,它们由重叠的阅读框编码。已鉴定出10种HBV基因型(A - J),这可能解释了传播、感染结果和治疗反应方面的差异。然而,HBV基因分型很少进行,并且在许多情况下测序仍然难以实现。我们着手使用基于随机森林算法(RFA)的机器学习方法来评估HBV基因组中的哪些氨基酸(aa)位点对于确定基因型最具信息价值。我们从一个公共数据库中下载了5496条全长HBV序列,排除了重组序列、具有保守插入缺失的区域以及基因型I和J。每个基因分别翻译成氨基酸,并将这些蛋白质连接成一个单一序列(长度为1614个氨基酸)。使用RFA,我们搜索预测基因型的氨基酸位点,并使用基于互信息的方法评估这些位点之间的共变情况。我们能够利用10个氨基酸位点可靠地区分A - H基因型。这些位点中有一半(5/10)位于聚合酶(Pol)中,其中4/5位于间隔区,1个位于逆转录酶中。另外4/10个位点位于表面蛋白中,1个位点位于HBx中。核心蛋白中没有信息位点。在信息位点上,氨基酸的特性在不同基因型之间通常不保守。在共变程度最高的位点对中,有55对包含这些“十大”位点之一。总体而言,我们已经表明RFA分析是一种强大的工具,可用于识别预测HBV谱系的氨基酸位点,间隔区中这类位点的数量出乎意料地多,而传统上认为间隔区对结构或功能并不重要。我们的结果提高了从HBV序列有限区域进行基因型预测的便利性,并且可能在理解HBV进化方面有未来的应用。