St Matthews University School of Medicine, Grand Cayman, Cayman Islands, The University of Wisconsin-Stout, Menomonie, USA.
J Comput Aided Mol Des. 2011 May;25(5):427-41. doi: 10.1007/s10822-011-9429-x. Epub 2011 May 3.
A patent data base of 6.7 million compounds generated by a very high performance computer (Blue Gene) requires new techniques for exploitation when extensive use of chemical similarity is involved. Such exploitation includes the taxonomic classification of chemical themes, and data mining to assess mutual information between themes and companies. Importantly, we also launch candidates that evolve by "natural selection" as failure of partial match against the patent data base and their ability to bind to the protein target appropriately, by simulation on Blue Gene. An unusual feature of our method is that algorithms and workflows rely on dynamic interaction between match-and-edit instructions, which in practice are regular expressions. Similarity testing by these uses SMILES strings and, less frequently, graph or connectivity representations. Examining how this performs in high throughput, we note that chemical similarity and novelty are human concepts that largely have meaning by utility in specific contexts. For some purposes, mutual information involving chemical themes might be a better concept.
一个由高性能计算机(Blue Gene)生成的包含 670 万种化合物的专利数据库,在涉及广泛使用化学相似性时,需要新的技术来开发利用。这种开发利用包括化学主题的分类学分类,以及数据挖掘以评估主题和公司之间的互信息。重要的是,我们还通过在 Blue Gene 上的模拟,推出了通过“自然选择”进化的候选物,因为它们与专利数据库的部分匹配失败,以及它们与蛋白质靶标适当结合的能力。我们的方法的一个不寻常的特点是,算法和工作流程依赖于匹配和编辑指令之间的动态交互,这些指令在实践中是正则表达式。这些用法通过 SMILES 字符串进行相似性测试,并且不太频繁地使用图形或连通性表示。在考察这种方法在高通量中的表现时,我们注意到化学相似性和新颖性是人类概念,它们在特定上下文中的实用性方面具有很大的意义。对于某些目的而言,涉及化学主题的互信息可能是一个更好的概念。