• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HIPPI:利用隐马尔可夫模型集合进行高精度蛋白质家族分类

HIPPI: highly accurate protein family classification with ensembles of HMMs.

作者信息

Nguyen Nam-Phuong, Nute Michael, Mirarab Siavash, Warnow Tandy

机构信息

Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, CA, USA.

Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Urbana, 61820, IL, USA.

出版信息

BMC Genomics. 2016 Nov 11;17(Suppl 10):765. doi: 10.1186/s12864-016-3097-0.

DOI:10.1186/s12864-016-3097-0
PMID:28185571
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5123343/
Abstract

BACKGROUND

Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics.

RESULTS

We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy.

CONCLUSION

HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp .

摘要

背景

给定一个新的生物序列,在许多生物信息学分析中,检测其是否属于已知家族是一个基本步骤,可应用于蛋白质结构和功能预测、宏基因组分类群鉴定及丰度分析等。然而,对于与公共数据库中的序列关系较远或为片段性的序列进行家族鉴定,仍然是生物信息学中较为困难的分析问题之一。

结果

我们提出了一种用于家族鉴定的新技术,称为HIPPI(用于蛋白质家族鉴定的分层轮廓隐马尔可夫模型)。HIPPI使用一种新颖的技术,通过使用HMMER计算的一组轮廓隐马尔可夫模型来表示给定蛋白质家族或超家族的多序列比对。在Pfam数据库上对HIPPI的评估表明,HIPPI比blastp、HMMER以及基于HHsearch的流程具有更好的总体精度和召回率,并且即使对于片段性查询序列以及平均成对序列同一性较低的蛋白质家族,也能保持良好的准确性,而在这两种情况下其他方法的准确性都会下降。

结论

HIPPI提供了准确的蛋白质家族鉴定,并且对困难的模型条件具有鲁棒性。我们的结果与先前研究的观察结果相结合,表明轮廓隐马尔可夫模型的集合比单个轮廓隐马尔可夫模型能更好地表示多序列比对,因此可以改善各种生物信息学任务的下游分析。需要进一步研究以确定构建轮廓隐马尔可夫模型集合的最佳实践。HIPPI可在GitHub上获取,网址为https://github.com/smirarab/sepp 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/39080e5f4858/12864_2016_3097_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/987920b66371/12864_2016_3097_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/4a4c63fcfa03/12864_2016_3097_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/91ace9ebb340/12864_2016_3097_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/39080e5f4858/12864_2016_3097_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/987920b66371/12864_2016_3097_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/4a4c63fcfa03/12864_2016_3097_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/91ace9ebb340/12864_2016_3097_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3817/5123343/39080e5f4858/12864_2016_3097_Fig4_HTML.jpg

相似文献

1
HIPPI: highly accurate protein family classification with ensembles of HMMs.HIPPI:利用隐马尔可夫模型集合进行高精度蛋白质家族分类
BMC Genomics. 2016 Nov 11;17(Suppl 10):765. doi: 10.1186/s12864-016-3097-0.
2
Identifying protein domains with the Pfam database.使用Pfam数据库鉴定蛋白质结构域。
Curr Protoc Bioinformatics. 2008 Sep;Chapter 2:2.5.1-2.5.17. doi: 10.1002/0471250953.bi0205s23.
3
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.使用代表所有已知结构蛋白质的隐马尔可夫模型库将同源性分配给基因组序列。
J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.
4
MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences.MAGUS+隐马尔可夫模型:提高了片段序列的多序列比对准确性。
Bioinformatics. 2022 Jan 27;38(4):918-924. doi: 10.1093/bioinformatics/btab788.
5
Accurate domain identification with structure-anchored hidden Markov models, saHMMs.基于结构锚定隐马尔可夫模型(saHMMs)的精确领域识别。
Proteins. 2009 Aug 1;76(2):343-52. doi: 10.1002/prot.22349.
6
Ultra-large alignments using phylogeny-aware profiles.使用系统发育感知概况的超大比对。
Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z.
7
UPP2: fast and accurate alignment of datasets with fragmentary sequences.UPP2:快速准确地对齐具有片段序列的数据集。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.
8
Protein homology detection by HMM-HMM comparison.通过隐马尔可夫模型(HMM)比较进行蛋白质同源性检测。
Bioinformatics. 2005 Apr 1;21(7):951-60. doi: 10.1093/bioinformatics/bti125. Epub 2004 Nov 5.
9
The HMMER Web Server for Protein Sequence Similarity Search.用于蛋白质序列相似性搜索的HMMER网络服务器。
Curr Protoc Bioinformatics. 2017 Dec 8;60:3.15.1-3.15.23. doi: 10.1002/cpbi.40.
10
Alignment of multiple proteins with an ensemble of hidden Markov models.使用隐马尔可夫模型集合对多个蛋白质进行比对。
Int J Data Min Bioinform. 2010;4(1):60-71. doi: 10.1504/ijdmb.2010.030967.

引用本文的文献

1
Revolutionizing Medicinal Chemistry: The Application of Artificial Intelligence (AI) in Early Drug Discovery.变革药物化学:人工智能在早期药物发现中的应用。
Pharmaceuticals (Basel). 2023 Sep 6;16(9):1259. doi: 10.3390/ph16091259.
2
WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity.WITCH-NG:对具有序列长度异质性的数据集进行高效且准确的比对。
Bioinform Adv. 2023 Mar 6;3(1):vbad024. doi: 10.1093/bioadv/vbad024. eCollection 2023.
3
SMRT Sequencing of the Full-Length Transcriptome of the ..的全长转录组的单分子实时测序

本文引用的文献

1
Unexpected features of the dark proteome.黑暗蛋白质组的意外特征。
Proc Natl Acad Sci U S A. 2015 Dec 29;112(52):15898-903. doi: 10.1073/pnas.1508380112. Epub 2015 Nov 17.
2
Automated and Accurate Estimation of Gene Family Abundance from Shotgun Metagenomes.从鸟枪法宏基因组中自动准确估计基因家族丰度
PLoS Comput Biol. 2015 Nov 13;11(11):e1004573. doi: 10.1371/journal.pcbi.1004573. eCollection 2015 Nov.
3
Ultra-large alignments using phylogeny-aware profiles.使用系统发育感知概况的超大比对。
Front Genet. 2021 Oct 14;12:741243. doi: 10.3389/fgene.2021.741243. eCollection 2021.
4
Master Blaster: an approach to sensitive identification of remotely related proteins.主爆破手:一种远程相关蛋白质敏感识别方法。
Sci Rep. 2021 Apr 22;11(1):8746. doi: 10.1038/s41598-021-87833-4.
5
OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.OMAmer:基于树的、无需比对的蛋白质亚家族分配方法优于最接近序列的方法。
Bioinformatics. 2021 Sep 29;37(18):2866-2873. doi: 10.1093/bioinformatics/btab219.
6
Multimodal deep representation learning for protein interaction identification and protein family classification.基于多模态深度表示学习的蛋白质相互作用识别和蛋白质家族分类。
BMC Bioinformatics. 2019 Dec 2;20(Suppl 16):531. doi: 10.1186/s12859-019-3084-y.
7
ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer.ViFi:准确检测病毒整合和 mRNA 融合揭示了宫颈癌近端基因组区域的无差别和不受调节的转录。
Nucleic Acids Res. 2018 Apr 20;46(7):3309-3325. doi: 10.1093/nar/gky180.
Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z.
4
TIPP: taxonomic identification and phylogenetic profiling.TIPP:分类鉴定和系统发育分析。
Bioinformatics. 2014 Dec 15;30(24):3548-55. doi: 10.1093/bioinformatics/btu721. Epub 2014 Oct 29.
5
Profile hidden Markov models for the detection of viruses within metagenomic sequence data.用于在宏基因组序列数据中检测病毒的轮廓隐马尔可夫模型。
PLoS One. 2014 Aug 20;9(8):e105067. doi: 10.1371/journal.pone.0105067. eCollection 2014.
6
Pfam: the protein families database.Pfam:蛋白质家族数据库。
Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.
7
Metagenomic species profiling using universal phylogenetic marker genes.基于通用系统发育标记基因的宏基因组物种分析。
Nat Methods. 2013 Dec;10(12):1196-9. doi: 10.1038/nmeth.2693. Epub 2013 Oct 20.
8
The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification.PhyloFacts FAT-CAT 网络服务器:使用快速近似树分类进行直系同源基因鉴定和功能预测。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W242-8. doi: 10.1093/nar/gkt399. Epub 2013 May 18.
9
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。
Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.
10
Metagenomic microbial community profiling using unique clade-specific marker genes.基于独特进化枝特异性标记基因的宏基因组微生物群落分析。
Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.