Suppr超能文献

使用整数熵编码对化学指纹进行无损压缩可改善存储和检索。

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

作者信息

Baldi Pierre, Benz Ryan W, Hirschberg Daniel S, Swamidass S Joshua

机构信息

Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California-Irvine, Irvine, CA 92697-3435, USA.

出版信息

J Chem Inf Model. 2007 Nov-Dec;47(6):2098-109. doi: 10.1021/ci700200n. Epub 2007 Oct 30.

Abstract

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.

摘要

许多用于小分子的现代化学信息学系统依赖于大型指纹向量表示,其中向量的组件记录特定组合特征(如标记路径或标记树)在分子图中的存在情况或出现次数。这些大型指纹向量通常使用基于简单模运算过程的有损压缩方案压缩为更短的指纹向量。在此,我们将指纹的统计模型与整数熵编码(如哥伦布码和伊莱亚斯码)相结合,以对指纹的索引或游程长度进行编码。在按频率降序对指纹组件重新排序后,索引是单调递增的,游程长度是准单调递增的,并且两者都呈现幂律分布趋势。我们利用这些统计特性来推导针对单调整数序列的新的高效无损压缩算法:单调值(MOV)编码和单调长度(MOL)编码。与每个分子使用1024位或更多存储位的有损系统相比,我们可以基于循环子结构,使用MOL伊莱亚斯伽马码对游程长度进行编码,以略高于每个分子300位的方式实现长化学指纹的无损压缩,接近香农熵极限。存储方面的改进代价是适度的计算成本。此外,由于压缩是无损的,分子之间的未压缩相似度(如塔尼莫托相似度)可以从其压缩表示中精确计算出来,从而在检索性能上带来显著提升,如在六个类药物分子的基准数据集上所示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a7ab/2536658/0b28421c24ac/nihms-62715-f0001.jpg

相似文献

3
Accelerating chemical database searching using graphics processing units.利用图形处理单元加速化学数据库搜索。
J Chem Inf Model. 2011 Aug 22;51(8):1807-16. doi: 10.1021/ci200164g. Epub 2011 Jul 13.

引用本文的文献

1
Molecular similarity: Theory, applications, and perspectives.分子相似性:理论、应用与展望。
Artif Intell Chem. 2024 Dec;2(2). doi: 10.1016/j.aichem.2024.100077. Epub 2024 Aug 31.
2
Computer-Assisted Retrosynthesis Based on Molecular Similarity.基于分子相似性的计算机辅助逆合成分析
ACS Cent Sci. 2017 Dec 27;3(12):1237-1245. doi: 10.1021/acscentsci.7b00355. Epub 2017 Nov 16.
4
Methods for Similarity-based Virtual Screening.基于相似性的虚拟筛选方法。
Comput Struct Biotechnol J. 2013 Mar 3;5:e201302009. doi: 10.5936/csbj.201302009. eCollection 2013.
10
Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。
Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

本文引用的文献

4
Cheminformatics analysis and learning in a data pipelining environment.数据管道环境中的化学信息学分析与学习
Mol Divers. 2006 Aug;10(3):283-99. doi: 10.1007/s11030-006-9041-5. Epub 2006 Sep 22.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验