Department of Biological Sciences, Auburn University, Auburn, AL, USA.
Whitman Center, Marine Biological Laboratory, Woods Hole, MA, USA.
Mol Biol Evol. 2021 Dec 9;38(12):5806-5818. doi: 10.1093/molbev/msab258.
Sequence annotation is fundamental for studying the evolution of protein families, particularly when working with nonmodel species. Given the rapid, ever-increasing number of species receiving high-quality genome sequencing, accurate domain modeling that is representative of species diversity is crucial for understanding protein family sequence evolution and their inferred function(s). Here, we describe a bioinformatic tool called Taxon-Informed Adjustment of Markov Model Attributes (TIAMMAt) which revises domain profile hidden Markov models (HMMs) by incorporating homologous domain sequences from underrepresented and nonmodel species. Using innate immunity pathways as a case study, we show that revising profile HMM parameters to directly account for variation in homologs among underrepresented species provides valuable insight into the evolution of protein families. Following adjustment by TIAMMAt, domain profile HMMs exhibit changes in their per-site amino acid state emission probabilities and insertion/deletion probabilities while maintaining the overall structure of the consensus sequence. Our results show that domain revision can heavily impact evolutionary interpretations for some families (i.e., NLR's NACHT domain), whereas impact on other domains (e.g., rel homology domain and interferon regulatory factor domains) is minimal due to high levels of sequence conservation across the sampled phylogenetic depth (i.e., Metazoa). Importantly, TIAMMAt revises target domain models to reflect homologous sequence variation using the taxonomic distribution under consideration by the user. TIAMMAt's flexibility to revise any subset of the Pfam database using a user-defined taxonomic pool will make it a valuable tool for future protein evolution studies, particularly when incorporating (or focusing) on nonmodel species.
序列注释是研究蛋白质家族进化的基础,特别是在处理非模式物种时。鉴于越来越多的物种快速获得高质量的基因组测序,准确代表物种多样性的结构域建模对于理解蛋白质家族序列进化及其推断的功能至关重要。在这里,我们描述了一种名为“Taxon-Informed Adjustment of Markov Model Attributes(TIAMMAt)”的生物信息学工具,它通过整合代表性不足和非模式物种的同源结构域序列来修改结构域轮廓隐马尔可夫模型(HMM)。我们使用先天免疫途径作为案例研究,表明通过 TIAMMAt 直接调整同源物在代表性不足物种之间的变化来修改结构域轮廓 HMM 参数,可以为蛋白质家族的进化提供有价值的见解。在 TIAMMAt 调整后,结构域轮廓 HMM 在每个位置的氨基酸状态发射概率和插入/缺失概率发生变化,同时保持共识序列的整体结构。我们的结果表明,对于某些家族(即 NLR 的 NACHT 结构域),结构域修正会对进化解释产生重大影响,而对于其他结构域(例如,Rel 同源结构域和干扰素调节因子结构域)的影响则很小,因为在采样的系统发育深度内(即 Metazoa),序列的保守性很高。重要的是,TIAMMAt 使用用户考虑的分类分布来修改目标结构域模型,以反映同源序列的变化。TIAMMAt 可以灵活地修改 Pfam 数据库的任何子集,使用用户定义的分类池,这将使其成为未来蛋白质进化研究的有价值的工具,特别是在整合(或关注)非模式物种时。