Suppr超能文献

重新优化用于药物发现的分子描述符语言(MDL)键。

Reoptimization of MDL keys for use in drug discovery.

作者信息

Durant Joseph L, Leland Burton A, Henry Douglas R, Nourse James G

机构信息

MDL Information Systems, 14600 Catalina Street, San Leandro, California 94577, USA.

出版信息

J Chem Inf Comput Sci. 2002 Nov-Dec;42(6):1273-80. doi: 10.1021/ci010132r.

Abstract

For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.

摘要

多年来,MDL产品基于二维描述符公开了166位和960位的键集。这些键集最初是为子结构搜索而构建和优化的。我们报告了MDL键集性能的改进情况,这些键集经过重新优化后用于分子相似性分析。对于一个包含957种化合物的测试数据集,分类性能从166位键集的0.65和960位键集的0.67提高到了包含208位的意外值S/N修剪键集的0.71以及包含548位的遗传算法优化键集的0.71。我们概述了支持描述符定义以及将这些描述符编码为键集的基础技术。该技术允许将描述符定义为原子属性、键属性以及不同拓扑距离处的原子邻域的组合,还支持多种自定义描述符。然后这些描述符可用于在键集中设置一位或多位。我们构建了各种键集,并在对生物活性物质进行聚类时优化了它们的性能。使用Briem和Lessel开发的方法来衡量性能。“定向修剪”是通过基于随机选择、位的意外值或位的意外值S/N比从键集中消除位来进行的。随机修剪实验突出了对于长度超过1000位的键集,键集性能的不敏感性。与最初的预期相反,基于各个位的意外值进行修剪得到的键集表现不如随机修剪得到的键集。相比之下,发现基于意外值S/N比进行修剪能产生比随机修剪得到的键集性能更好的键集。我们还探索了使用遗传算法来选择最优键集。性能再次只是键集大小的一个弱函数,并且优化未能识别出单个全局最优键集。相反,可以生成多个同样最优的键集,它们所编码的描述符重叠相对较低。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验