Department of Chemistry, Purdue University, West Lafayette, IN, USA.
Bioinformatics. 2019 Oct 15;35(20):4165-4167. doi: 10.1093/bioinformatics/btz178.
The Protein Data Bank (PDB) currently holds over 140 000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB.
Our solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in <10 min on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python bindings provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features.
The Lemon software is available as a C++ header library along with a PyPI package and example functions at https://github.com/chopralab/lemon.
Supplementary data are available at Bioinformatics online.
蛋白质数据库 (PDB) 目前拥有超过 140000 个生物分子结构,并继续每周发布新的结构。PDB 是结构生物信息学社区开发软件的重要资源,这些软件可以挖掘、使用、分类和分析这些数据。新的计算生物学方法使用从 PDB 中提取的 3D 实验确定结构和结构特征的自定义基准测试集进行评估。目前,此类基准测试特征是使用自定义脚本以非标准化的方式手动整理的,这导致新实验结构的分布和更新速度较慢。最后,缺乏标准化工具来快速查询整个 PDB 的 3D 描述符。
我们的解决方案是 Lemon 框架,这是一个带有 Python 绑定的 C++11 库,它提供了一种基于用户标准选择生物分子相互作用并计算所需 3D 结构特征的一致工作流程方法。该框架可以在现代多线程硬件上 <10 分钟内解析和描述整个 PDB。通过使用最近开发的大分子传输格式,可以减少读取基于文本的 PDB 文件的计算成本,从而提高解析速度。使用 C++lambda 函数和 Python 绑定为通过允许用户编写自定义函数来满足其目标,为 PDB 的分析和分类提供了广泛的灵活性。我们认为 Lemon 将成为一个一站式服务,可以快速挖掘整个 PDB 以生成所需的结构生物学特征。
Lemon 软件作为 C++头文件库以及 PyPI 包和示例函数在 https://github.com/chopralab/lemon 上提供。
补充数据可在 Bioinformatics 在线获取。