Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka, India.
Department of Biotechnology, Faculty of Life and Allied Health Sciences, M.S. Ramaiah University of Applied Sciences, Bangalore, Karnataka, India.
Methods Mol Biol. 2022;2449:149-167. doi: 10.1007/978-1-0716-2095-3_5.
Sequence-based approaches are fundamental to guide experimental investigations in obtaining structural and/or functional insights into uncharacterized protein families. Powerful profile-based sequence search methods rely on a sequence space continuum to identify non-trivial relationships through homology detection. The computational design of protein-like sequences that serve as "artificial linkers" is useful in identifying relationships between distant members of a structural fold. Such sequences act as intermediates and guide homology searches between distantly related proteins. Here, we describe an approach that represents natural intermediate sequences and designed protein-like sequences as HMM (Hidden Markov Models) profiles, to improve the sensitivity of existing search methods. Searches made within the "Profile database" were shown to recognize the parent structural fold for 90% of the search queries at query coverage better than 60%. For 1040 protein families with no available structure, fold associations were made through searches in the database of natural and designed sequence profiles. Most of the associations were made with the Alpha-alpha superhelix, Transmembrane beta-barrels, TIM barrel, and Immunoglobulin-like beta-sandwich folds. For 11 domain families of unknown functions, we provide confident fold associations using the profiles of designed sequences and a consensus from other fold recognition methods. For two DUFs (Domain families of Unknown Functions), we performed detailed functional annotation through comparisons with characterized templates of families of known function.
基于序列的方法是指导实验研究的基础,可获得对未表征蛋白质家族的结构和/或功能的深入了解。功能强大的基于轮廓的序列搜索方法依赖于序列空间连续统,通过同源检测来识别非平凡关系。设计类似于蛋白质的序列作为“人工接头”,可用于识别结构折叠中远距离成员之间的关系。这些序列充当中间体,并指导远距离相关蛋白质之间的同源搜索。在这里,我们描述了一种方法,即将天然中间序列和设计的蛋白质序列表示为 HMM(隐马尔可夫模型)轮廓,以提高现有搜索方法的灵敏度。在“轮廓数据库”中进行的搜索显示,在查询覆盖率优于 60%的情况下,对于 90%的查询,能够识别出父结构折叠。对于 1040 个没有可用结构的蛋白质家族,通过在天然和设计序列轮廓数据库中进行搜索,建立了折叠关联。大多数关联都是与 Alpha-alpha 超螺旋、跨膜β桶、TIM 桶和免疫球蛋白样β三明治折叠有关。对于 11 个功能未知的域家族,我们使用设计序列的轮廓和其他折叠识别方法的共识提供了可靠的折叠关联。对于两个 DUF(功能未知的域家族),我们通过与已知功能家族的特征模板进行比较,进行了详细的功能注释。