SHEPHARD：一种用于分析和注释大型蛋白质数据集的模块化和可扩展的软件架构。

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.

机构信息

Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, 660 South Euclid Avenue, Saint Louis, MO 63110, United States.

Center for Biomolecular Condensates, Washington University in St. Louis, 1 Brookings Drive, Saint Louis, MO 63130, United States.

出版信息

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad488.

DOI:10.1093/bioinformatics/btad488

PMID:37540173

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10423030/

Abstract

MOTIVATION

The emergence of high-throughput experiments and high-resolution computational predictions has led to an explosion in the quality and volume of protein sequence annotations at proteomic scales. Unfortunately, sanity checking, integrating, and analyzing complex sequence annotations remains logistically challenging and introduces a major barrier to entry for even superficial integrative bioinformatics.

RESULTS

To address this technical burden, we have developed SHEPHARD, a Python framework that trivializes large-scale integrative protein bioinformatics. SHEPHARD combines an object-oriented hierarchical data structure with database-like features, enabling programmatic annotation, integration, and analysis of complex datatypes. Importantly SHEPHARD is easy to use and enables a Pythonic interrogation of largescale protein datasets with millions of unique annotations. We use SHEPHARD to examine three orthogonal proteome-wide questions relating protein sequence to molecular function, illustrating its ability to uncover novel biology.

AVAILABILITY AND IMPLEMENTATION

We provided SHEPHARD as both a stand-alone software package (https://github.com/holehouse-lab/shephard), and as a Google Colab notebook with a collection of precomputed proteome-wide annotations (https://github.com/holehouse-lab/shephard-colab).

摘要

动机

高通量实验和高分辨率计算预测的出现导致了蛋白质序列注释在蛋白质组学规模上的质量和数量呈爆炸式增长。不幸的是，即使是肤浅的综合生物信息学，对这些复杂序列注释进行合理性检查、整合和分析在逻辑上仍然具有挑战性，这引入了一个主要的进入障碍。

结果

为了解决这个技术负担，我们开发了 SHEPHARD，这是一个 Python 框架，它使大规模综合蛋白质生物信息学变得轻而易举。SHEPHARD 将面向对象的层次数据结构与数据库特性相结合，使复杂数据类型的程序式注释、整合和分析成为可能。重要的是，SHEPHARD 易于使用，并且能够以 Pythonic 的方式对具有数百万个独特注释的大规模蛋白质数据集进行查询。我们使用 SHEPHARD 来检查三个与蛋白质序列与分子功能相关的正交蛋白质组学问题，说明了它揭示新生物学的能力。

可用性和实现

我们提供了 SHEPHARD 作为一个独立的软件包（https://github.com/holehouse-lab/shephard），以及一个带有预计算蛋白质组注释集合的 Google Colab 笔记本（https://github.com/holehouse-lab/shephard-colab）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/887a/10423030/14fa8a5d9f9b/btad488f1.jpg

相似文献

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.SHEPHARD：一种用于分析和注释大型蛋白质数据集的模块化和可扩展的软件架构。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad488.

ICBdocker: a Docker image for proteome annotation and visualization.ICBdocker：一个用于蛋白质组注释和可视化的 Docker 镜像。

Bioinformatics. 2018 Nov 15;34(22):3937-3938. doi: 10.1093/bioinformatics/bty493.

AlphaMap: an open-source Python package for the visual annotation of proteomics data with sequence-specific knowledge.AlphaMap：一个开源的 Python 软件包，用于利用序列特异性知识对蛋白质组学数据进行可视化注释。

Bioinformatics. 2022 Jan 12;38(3):849-852. doi: 10.1093/bioinformatics/btab674.

Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes.Pinstripe：一套用于整合转录组和蛋白质组数据集的程序，可识别新的蛋白质，并提高蛋白质编码和非编码基因的区分能力。

Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7.

localpdb-a Python package to manage protein structures and their annotations.本地 pdb-a Python 包，用于管理蛋白质结构及其注释。

Bioinformatics. 2022 Apr 28;38(9):2633-2635. doi: 10.1093/bioinformatics/btac121.

GOThresher: a program to remove annotation biases from protein function annotation datasets.GOThresher：一个用于去除蛋白质功能注释数据集中注释偏差的程序。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad048.

AnnoDUF: A Web-Based Tool for Annotating Functions of Proteins Having Domains of Unknown Function.AnnoDUF：一个基于网络的工具，用于注释具有未知功能域的蛋白质的功能。

J Proteome Res. 2024 Oct 4;23(10):4296-4302. doi: 10.1021/acs.jproteome.4c00251. Epub 2024 Aug 31.

Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.蛋白质序列注释工具（PSAT）：一个基于网络的集中式元服务器，用于高通量序列注释。

BMC Bioinformatics. 2016 Jan 20;17:43. doi: 10.1186/s12859-016-0887-y.

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets.认知器：宏基因组数据集功能注释框架

PLoS One. 2015 Nov 11;10(11):e0142102. doi: 10.1371/journal.pone.0142102. eCollection 2015.

VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data.VESPA：通过整合蛋白质组学和转录组学数据，为原核生物的基因组注释提供便利的软件。

BMC Genomics. 2012 Apr 5;13:131. doi: 10.1186/1471-2164-13-131.

引用本文的文献

Sequence-based prediction of intermolecular interactions driven by disordered regions.基于序列的由无序区域驱动的分子间相互作用预测

Science. 2025 May 22;388(6749):eadq8381. doi: 10.1126/science.adq8381.

Phosphorylation of disordered proteins tunes local and global intramolecular interactions.无序蛋白质的磷酸化调节局部和全局分子内相互作用。

Biophys J. 2024 Dec 3;123(23):4082-4096. doi: 10.1016/j.bpj.2024.10.021. Epub 2024 Nov 13.

Protein surface chemistry encodes an adaptive tolerance to desiccation.蛋白质表面化学编码了对干燥的适应性耐受性。

bioRxiv. 2024 Oct 10:2024.07.28.604841. doi: 10.1101/2024.07.28.604841.

Phosphorylation of disordered proteins tunes local and global intramolecular interactions.无序蛋白质的磷酸化调节局部和全局分子内相互作用。

bioRxiv. 2024 Jun 12:2024.06.10.598315. doi: 10.1101/2024.06.10.598315.

Direct prediction of intermolecular interactions driven by disordered regions.由无序区域驱动的分子间相互作用的直接预测

bioRxiv. 2024 Jun 3:2024.06.03.597104. doi: 10.1101/2024.06.03.597104.

Direct prediction of intrinsically disordered protein conformational properties from sequence.从序列直接预测内在无序蛋白质的构象性质。

Nat Methods. 2024 Mar;21(3):465-476. doi: 10.1038/s41592-023-02159-5. Epub 2024 Jan 31.

The molecular basis for cellular function of intrinsically disordered protein regions.无定形蛋白质区域的细胞功能的分子基础。

Nat Rev Mol Cell Biol. 2024 Mar;25(3):187-211. doi: 10.1038/s41580-023-00673-0. Epub 2023 Nov 13.

Aberrant phase separation is a common killing strategy of positively charged peptides in biology and human disease.异常相分离是带正电荷的肽在生物学和人类疾病中的一种常见杀伤策略。

bioRxiv. 2023 Mar 9:2023.03.09.531820. doi: 10.1101/2023.03.09.531820.

本文引用的文献

Mega-scale experimental analysis of protein folding stability in biology and design.大规模实验分析生物学和设计中的蛋白质折叠稳定性。

Nature. 2023 Aug;620(7973):434-444. doi: 10.1038/s41586-023-06328-6. Epub 2023 Jul 19.

The structural context of posttranslational modifications at a proteome-wide scale.在蛋白质组范围内对翻译后修饰进行结构背景分析。

PLoS Biol. 2022 May 16;20(5):e3001636. doi: 10.1371/journal.pbio.3001636. eCollection 2022 May.

Current progress and open challenges for applying deep learning across the biosciences.深度学习在整个生命科学中的应用现状及面临的开放性挑战。

Nat Commun. 2022 Apr 1;13(1):1728. doi: 10.1038/s41467-022-29268-7.

SWI/SNF senses carbon starvation with a pH-sensitive low-complexity sequence.SWI/SNF 通过一个对 pH 敏感的低复杂度序列感知碳饥饿。

Elife. 2022 Feb 7;11:e70344. doi: 10.7554/eLife.70344.

On the Potential of Machine Learning to Examine the Relationship Between Sequence, Structure, Dynamics and Function of Intrinsically Disordered Proteins.基于机器学习研究无规卷曲蛋白序列、结构、动力学与功能关系的潜力

J Mol Biol. 2021 Oct 1;433(20):167196. doi: 10.1016/j.jmb.2021.167196. Epub 2021 Aug 12.

Highly accurate protein structure prediction for the human proteome.高精准度的人类蛋白质组蛋白结构预测。

Nature. 2021 Aug;596(7873):590-596. doi: 10.1038/s41586-021-03828-1. Epub 2021 Jul 22.

Critical assessment of protein intrinsic disorder prediction.蛋白质固有无序预测的关键评估。

Nat Methods. 2021 May;18(5):472-481. doi: 10.1038/s41592-021-01117-3. Epub 2021 Apr 19.

Identifying molecular features that are associated with biological function of intrinsically disordered protein regions.识别与内在无序蛋白质区域的生物学功能相关的分子特征。

Elife. 2021 Feb 22;10:e60220. doi: 10.7554/eLife.60220.

Intrinsically disordered protein regions and phase separation: sequence determinants of assembly or lack thereof.无规则蛋白区域和相分离：组装或不组装的序列决定因素。

Emerg Top Life Sci. 2020 Dec 11;4(3):307-329. doi: 10.1042/ETLS20190164.

IDDomainSpotter: Compositional bias reveals domains in long disordered protein regions-Insights from transcription factors.IDDomainSpotter：组成性偏差揭示长无序蛋白质区域中的结构域——来自转录因子的见解。

Protein Sci. 2020 Jan;29(1):169-183. doi: 10.1002/pro.3754. Epub 2019 Nov 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SHEPHARD：一种用于分析和注释大型蛋白质数据集的模块化和可扩展的软件架构。

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献