Calhoun Bradley T, Browning Michael R, Chen Brian R, Bittker Joshua A, Swamidass S Joshua
Washington University School of Medicine, 660 S. Euclid, St Louis, MO 63108, USA.
J Biomol Screen. 2012 Sep;17(8):1071-9. doi: 10.1177/1087057112449054. Epub 2012 Jun 12.
Public databases that store the data from small-molecule screens are a rich and untapped resource of chemical and biological information. However, screening databases are unorganized, which makes interpreting their data difficult. We propose a method of inferring workflow graphs--which encode the relationships between assays in screening projects--directly from screening data and using these workflows to organize each project's data. On the basis of four heuristics regarding the organization of screening projects, we designed an algorithm that extracts a project's workflow graph from screening data. Where possible, the algorithm is evaluated by comparing each project's inferred workflow to its documentation. In the majority of cases, there are no discrepancies between the two. Most errors can be traced to points in the project where screeners chose additional molecules to test based on structural similarity to promising molecules, a case our algorithm is not yet capable of handling. Nonetheless, these workflows accurately organize most of the data and also provide a method of visualizing a screening project. This method is robust enough to build a workflow-oriented front-end to PubChem and is currently being used regularly by both our lab and our collaborators. A Python implementation of the algorithm is available online, and a searchable database of all PubChem workflows is available at http://swami.wustl.edu/flow.
存储小分子筛选数据的公共数据库是化学和生物信息的丰富且未被利用的资源。然而,筛选数据库是无组织的,这使得解释其数据变得困难。我们提出了一种直接从筛选数据推断工作流图(编码筛选项目中各测定之间的关系)并使用这些工作流来组织每个项目数据的方法。基于关于筛选项目组织的四种启发式方法,我们设计了一种从筛选数据中提取项目工作流图的算法。在可能的情况下,通过将每个项目推断的工作流与其文档进行比较来评估该算法。在大多数情况下,两者之间没有差异。大多数错误可追溯到项目中筛选人员根据与有前景分子的结构相似性选择额外分子进行测试的点,这是我们的算法尚无法处理的情况。尽管如此,这些工作流准确地组织了大部分数据,还提供了一种可视化筛选项目的方法。这种方法足够强大,可以构建一个面向工作流的PubChem前端,并且目前我们实验室和合作者都在定期使用。该算法的Python实现可在线获取,所有PubChem工作流的可搜索数据库可在http://swami.wustl.edu/flow获取。