Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, 660 South Euclid Avenue, Saint Louis, MO 63110, United States.
Center for Biomolecular Condensates, Washington University in St. Louis, 1 Brookings Drive, Saint Louis, MO 63130, United States.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad488.
The emergence of high-throughput experiments and high-resolution computational predictions has led to an explosion in the quality and volume of protein sequence annotations at proteomic scales. Unfortunately, sanity checking, integrating, and analyzing complex sequence annotations remains logistically challenging and introduces a major barrier to entry for even superficial integrative bioinformatics.
To address this technical burden, we have developed SHEPHARD, a Python framework that trivializes large-scale integrative protein bioinformatics. SHEPHARD combines an object-oriented hierarchical data structure with database-like features, enabling programmatic annotation, integration, and analysis of complex datatypes. Importantly SHEPHARD is easy to use and enables a Pythonic interrogation of largescale protein datasets with millions of unique annotations. We use SHEPHARD to examine three orthogonal proteome-wide questions relating protein sequence to molecular function, illustrating its ability to uncover novel biology.
We provided SHEPHARD as both a stand-alone software package (https://github.com/holehouse-lab/shephard), and as a Google Colab notebook with a collection of precomputed proteome-wide annotations (https://github.com/holehouse-lab/shephard-colab).
高通量实验和高分辨率计算预测的出现导致了蛋白质序列注释在蛋白质组学规模上的质量和数量呈爆炸式增长。不幸的是,即使是肤浅的综合生物信息学,对这些复杂序列注释进行合理性检查、整合和分析在逻辑上仍然具有挑战性,这引入了一个主要的进入障碍。
为了解决这个技术负担,我们开发了 SHEPHARD,这是一个 Python 框架,它使大规模综合蛋白质生物信息学变得轻而易举。SHEPHARD 将面向对象的层次数据结构与数据库特性相结合,使复杂数据类型的程序式注释、整合和分析成为可能。重要的是,SHEPHARD 易于使用,并且能够以 Pythonic 的方式对具有数百万个独特注释的大规模蛋白质数据集进行查询。我们使用 SHEPHARD 来检查三个与蛋白质序列与分子功能相关的正交蛋白质组学问题,说明了它揭示新生物学的能力。
我们提供了 SHEPHARD 作为一个独立的软件包(https://github.com/holehouse-lab/shephard),以及一个带有预计算蛋白质组注释集合的 Google Colab 笔记本(https://github.com/holehouse-lab/shephard-colab)。