Markey Cancer Center, University of Kentucky, Lexington, KY, 40536, USA.
Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY, 40536, USA.
BMC Bioinformatics. 2023 Mar 4;24(1):78. doi: 10.1186/s12859-023-05208-0.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides organized genomic, biomolecular, and metabolic information and knowledge that is reasonably current and highly useful for a wide range of analyses and modeling. KEGG follows the principles of data stewardship to be findable, accessible, interoperable, and reusable (FAIR) by providing RESTful access to their database entries via their web-accessible KEGG API. However, the overall FAIRness of KEGG is often limited by the library and software package support available in a given programming language. While R library support for KEGG is fairly strong, Python library support has been lacking. Moreover, there is no software that provides extensive command line level support for KEGG access and utilization.
We present kegg_pull, a package implemented in the Python programming language that provides better KEGG access and utilization functionality than previous libraries and software packages. Not only does kegg_pull include an application programming interface (API) for Python programming, it also provides a command line interface (CLI) that enables utilization of KEGG for a wide range of shell scripting and data analysis pipeline use-cases. As kegg_pull's name implies, both the API and CLI provide versatile options for pulling (downloading and saving) an arbitrary (user defined) number of database entries from the KEGG API. Moreover, this functionality is implemented to efficiently utilize multiple central processing unit cores as demonstrated in several performance tests. Many options are provided to optimize fault-tolerant performance across a single or multiple processes, with recommendations provided based on extensive testing and practical network considerations.
The new kegg_pull package enables new flexible KEGG retrieval use cases not available in previous software packages. The most notable new feature that kegg_pull provides is its ability to robustly pull an arbitrary number of KEGG entries with a single API method or CLI command, including pulling an entire KEGG database. We provide recommendations to users for the most effective use of kegg_pull according to their network and computational circumstances.
京都基因与基因组百科全书(KEGG)提供了组织化的基因组、生物分子和代谢信息与知识,这些信息和知识既具有时效性,又非常有助于进行广泛的分析和建模。KEGG 遵循数据管理原则,通过其可通过网络访问的 KEGG API 以 RESTful 方式访问其数据库条目,从而实现可查找、可访问、可互操作和可重复使用(FAIR)。然而,KEGG 的整体 FAIR 程度通常受到给定编程语言中可用的库和软件包支持的限制。虽然 R 库对 KEGG 的支持相当强大,但 Python 库的支持却一直不足。此外,没有软件提供广泛的命令行级支持来访问和利用 KEGG。
我们提出了 kegg_pull,这是一个用 Python 编程语言实现的软件包,它提供了比以前的库和软件包更好的 KEGG 访问和利用功能。kegg_pull 不仅包括用于 Python 编程的应用程序编程接口(API),还提供了命令行界面(CLI),使 KEGG 能够用于广泛的 shell 脚本和数据分析管道用例。正如 kegg_pull 的名称所暗示的,API 和 CLI 都提供了从 KEGG API 下载和保存任意(用户定义)数量的数据库条目的多功能选项。此外,此功能的实现可有效地利用多个中央处理单元内核,这在多个性能测试中得到了证明。提供了许多选项来优化单个或多个进程的容错性能,并根据广泛的测试和实际网络考虑因素提供了建议。
新的 kegg_pull 软件包支持以前的软件包中不可用的新的灵活的 KEGG 检索用例。kegg_pull 提供的最显著的新功能是其能够使用单个 API 方法或 CLI 命令可靠地提取任意数量的 KEGG 条目,包括提取整个 KEGG 数据库。我们根据用户的网络和计算情况为用户提供了使用 kegg_pull 的最有效建议。