支持对大型异构化学数据集进行数据挖掘的化学信息学方法和基础设施的进展。

Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets.

作者信息

Guha Rajarshi, Gilbert Kevin, Fox Geoffrey, Pierce Marlon, Wild David, Yuan Huapeng

机构信息

School of Informatics, Indiana University, Bloomington, IN 47408, USA.

出版信息

Curr Comput Aided Drug Des. 2010 Mar;6(1):50-67. doi: 10.2174/157340910790980115.

DOI:10.2174/157340910790980115

PMID:20370695

Abstract

In recent years, there has been an explosion in the availability of publicly accessible chemical information, including chemical structures of small molecules, structure-derived properties and associated biological activities in a variety of assays. These data sources present us with a significant opportunity to develop and apply computational tools to extract and understand the underlying structure-activity relationships. Furthermore, by integrating chemical data sources with biological information (protein structure, gene expression and so on), we can attempt to build up a holistic view of the effects of small molecules in biological systems. Equally important is the ability for non-experts to access and utilize state of the art cheminformatics method and models. In this review we present recent developments in cheminformatics methodologies and infrastructure that provide a robust, distributed approach to mining large and complex chemical datasets. In the area of methodology development, we highlight recent work on characterizing structure-activity landscapes, Quantitative Structure Activity Relationship (QSAR) model domain applicability and the use of chemical similarity in text mining. In the area of infrastructure, we discuss a distributed web services framework that allows easy deployment and uniform access to computational (statistics, cheminformatics and computational chemistry) methods, data and models. We also discuss the development of PubChem derived databases and highlight techniques that allow us to scale the infrastructure to extremely large compound collections, by use of distributed processing on Grids. Given that the above work is applicable to arbitrary types of cheminformatics problems, we also present some case studies related to virtual screening for anti-malarials and predictions of anti-cancer activity.

摘要

近年来，可公开获取的化学信息数量激增，包括小分子的化学结构、基于结构的性质以及在各种分析中相关的生物活性。这些数据源为我们提供了一个重要机会，来开发和应用计算工具，以提取和理解潜在的构效关系。此外，通过将化学数据源与生物信息（蛋白质结构、基因表达等）相结合，我们可以尝试构建小分子在生物系统中作用的整体视图。同样重要的是，非专家能够访问和使用先进的化学信息学方法和模型。在本综述中，我们介绍了化学信息学方法和基础设施的最新进展，这些进展提供了一种强大的、分布式的方法来挖掘大型复杂化学数据集。在方法开发领域，我们重点介绍了最近在表征构效格局、定量构效关系（QSAR）模型域适用性以及文本挖掘中化学相似性的应用等方面的工作。在基础设施领域，我们讨论了一个分布式网络服务框架，该框架允许轻松部署并统一访问计算（统计、化学信息学和计算化学）方法、数据和模型。我们还讨论了源自PubChem的数据库的开发，并强调了通过在网格上使用分布式处理，使我们能够将基础设施扩展到超大型化合物集合的技术。鉴于上述工作适用于任意类型的化学信息学问题，我们还介绍了一些与抗疟药物虚拟筛选和抗癌活性预测相关的案例研究。