Embrapa Informática Agropecuária, Campinas, São Paulo, Post Code 13083-886, PO Box 6041, Brazil.
Gigascience. 2020 Sep 14;9(9). doi: 10.1093/gigascience/giaa097.
Genome projects and multiomics experiments generate huge volumes of data that must be stored, mined, and transformed into useful knowledge. All this information is supposed to be accessible and, if possible, browsable afterwards. Computational biologists have been dealing with this scenario for more than a decade and have been implementing software and databases to meet this challenge. The GMOD's (Generic Model Organism Database) biological relational database schema, known as Chado, is one of the few successful open source initiatives; it is widely adopted and many software packages are able to connect to it.
We have been developing an open source software package named Machado, a genomics data integration framework implemented in Python, to enable research groups to both store and visualize genomics data. The framework relies on the Chado database schema and, therefore, should be very intuitive for current developers to adopt it or have it running on top of already existing databases. It has several data-loading tools for genomics and transcriptomics data and also for annotation results from tools such as BLAST, InterproScan, OrthoMCL, and LSTrAP. There is an API to connect to JBrowse, and a web visualization tool is implemented using Django Views and Templates. The Haystack library integrated with the ElasticSearch engine was used to implement a Google-like search, i.e., single auto-complete search box that provides fast results and filters.
Machado aims to be a modern object-relational framework that uses the latest Python libraries to produce an effective open source resource for genomics research.
基因组计划和多组学实验产生了大量的数据,这些数据必须存储、挖掘,并转化为有用的知识。所有这些信息都应该是可访问的,如果可能的话,以后还可以浏览。计算生物学家已经处理了十多年的这种情况,并一直在实施软件和数据库来应对这一挑战。GMOD(通用模式生物数据库)的生物关系数据库模式,称为 Chado,是为数不多的成功的开源项目之一;它被广泛采用,许多软件包都能够与之连接。
我们一直在开发一个名为 Machado 的开源软件包,这是一个用 Python 实现的基因组学数据集成框架,使研究小组能够存储和可视化基因组学数据。该框架依赖于 Chado 数据库模式,因此,对于当前的开发人员来说,采用它或在现有的数据库之上运行它应该是非常直观的。它有几个用于基因组学和转录组学数据以及 BLAST、InterproScan、OrthoMCL 和 LSTrAP 等工具的注释结果的数据加载工具。它有一个连接到 JBrowse 的 API,并且使用 Django Views 和 Templates 实现了一个 Web 可视化工具。Haystack 库与 ElasticSearch 引擎集成,用于实现类似于 Google 的搜索,即单个自动完成搜索框,提供快速的结果和过滤。
Machado 旨在成为一个现代的对象关系框架,它使用最新的 Python 库为基因组学研究提供一个有效的开源资源。