Heaps H S
J Chem Inf Comput Sci. 1975 Feb;15(1):32-9. doi: 10.1021/ci60001a011.
Consideration is given to a document data base that is structured for information retrieval purposes by means of an inverted index and term dictionary. Vocabulary characteristics of various fields are described, and it is shown how the data base may be stored in a compressed form by use of restricted variable length codes that produce a compression not greatly in excess of the optimum that could be achieved through use of Huffman codes. The coding is word oriented. An alternative scheme of word fragment coding is described. It has the advantage that it allows the use of a small dictionary, but is less efficient with respect to compression of the data base.
考虑一个文档数据库,它通过倒排索引和术语词典进行结构化以用于信息检索目的。描述了各个字段的词汇特征,并展示了如何通过使用受限可变长度码以压缩形式存储数据库,这种编码产生的压缩率不会大大超过通过使用霍夫曼码所能达到的最佳压缩率。编码是以单词为导向的。还描述了一种单词片段编码的替代方案。它的优点是允许使用较小的词典,但在数据库压缩方面效率较低。