Mozaffari Soroush, Arrías Paula Nazarena, Clementel Damiano, Piovesan Damiano, Ferrari Carlo, Tosatto Silvio C E, Monzon Alexander Miguel
Department of Biomedical Sciences, University of Padova, Padova 35121, Italy.
Department of Protein Science, KTH Royal Institute of Technology, Stockholm SE-10691, Sweden.
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae690.
Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.
We introduce STRPsearch, a novel tool for the rapid identification, classification, and mapping of STRPs. Leveraging manually curated entries from RepeatsDB as the known conformational space of STRPs, STRPsearch uses the latest advances in structural alignment for a fast and accurate detection of repeated structural motifs in proteins, followed by an innovative approach to map units and insertions through the generation of TM-score profiles. STRPsearch is highly scalable, efficiently processing large datasets, and can be applied to both experimental structures and predicted models. In addition, it demonstrates superior performance compared to existing tools, offering researchers a reliable and comprehensive solution for STRP analysis across diverse proteomes.
STRPsearch is coded in Python. All scripts and associated documentation are available from: https://github.com/BioComputingUP/STRPsearch.
结构化串联重复蛋白(STRP)构成了串联重复的一个子类,其特征在于重复的结构基序。这些蛋白质呈现出独特的二级结构,形成重复的三级排列,常常导致大分子组装体的形成。尽管序列高度可变,但STRP能够执行重要且多样的生物学功能,通过可变数量的重复单元维持一致的结构。随着蛋白质结构预测方法的出现,现在有数百万个蛋白质的三维模型可供公开获取。然而,由于当前最先进的工具缺乏准确性且执行时间长,STRP的自动检测仍然具有挑战性,这阻碍了它们在大型数据集上的应用。在大多数情况下,人工筛选仍然是检测和分类STRP最准确的方法,这使得对数百万个结构进行注释变得不切实际。
我们引入了STRPsearch,这是一种用于快速识别、分类和映射STRP的新型工具。STRPsearch利用来自RepeatsDB的人工筛选条目作为STRP的已知构象空间,采用结构比对的最新进展来快速准确地检测蛋白质中的重复结构基序,随后通过生成TM分数概况采用创新方法来映射单元和插入片段。STRPsearch具有高度可扩展性,能够高效处理大型数据集,并且可应用于实验结构和预测模型。此外,与现有工具相比,它表现出卓越的性能,为研究人员提供了一个跨多种蛋白质组进行STRP分析的可靠且全面的解决方案。
STRPsearch用Python编写。所有脚本和相关文档可从以下网址获取:https://github.com/BioComputingUP/STRPsearch 。