Terrence Donnelly Centre for Cellular and Biomolecular Research, Kim Lab, University of Toronto, Toronto, ON M5S 3E1, Canada, CUBE-Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria and Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85764 Neuherberg, Germany.
Nucleic Acids Res. 2014 Jan;42(Database issue):D279-84. doi: 10.1093/nar/gkt970. Epub 2013 Oct 27.
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith-Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads.
蛋白质相似性矩阵 (SIMAP,http://mips.gsf.de/simap/) 数据库旨在大规模加速生物信息学中计算成本高昂的蛋白质序列分析任务。它提供了预先计算的序列相似性,将整个已知的蛋白质序列宇宙相互连接,同时还提供了预先计算的蛋白质特征和结构域、相似性簇和功能注释。SIMAP 涵盖了所有主要的公共蛋白质数据库,以及来自不同存储库的许多一致重新注释的宏基因组。截至 2013 年 9 月,SIMAP 包含超过 1.63 亿个蛋白质,对应约 7000 万个非冗余序列。SIMAP 使用敏感的 FASTA 搜索启发式算法、Smith-Waterman 比对算法、InterPro 蛋白质结构域模型数据库和 BLAST2GO 功能注释算法。SIMAP 通过促进对蛋白质序列宇宙的交互式探索来协助生物学家。Web-Service 和 DAS 接口允许将 SIMAP 与任何其他生物信息学工具和资源连接。根据请求生成特定于项目的蛋白质集合的全对全蛋白质序列相似性矩阵。最近的改进允许 SIMAP 涵盖快速增长的测序蛋白质序列宇宙。新的 Web-Service 接口增强了 SIMAP 的连接性。添加了用于交互式提取蛋白质相似性网络的新工具。通过门户提供对 SIMAP 的开放访问;该门户还包含软件访问和平面文件下载的说明和链接。