Wilbur W J, Yang Y
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Comput Biol Med. 1996 May;26(3):209-22. doi: 10.1016/0010-4825(95)00055-0.
The biological literature presents a difficult challenge to information processing in its complexity, diversity, and in its sheer volume. Much of the diversity resides in its technical terminology, which has also become voluminous. In an effort to deal more effectively with this large vocabulary and improve information processing, a method of focus has been developed which allows one to classify terms based on a measure of their importance in describing the content of the documents in which they occur. The measurement is called the strength of a term and is a measure of how strongly the term's occurrences correlate with the subjects of documents in the database. If term occurrences are random then there will be no correlation and the strength will be zero, but if for any subject, the term is either always present or never present its strength will be one. We give here a new, information theoretical interpretation of term strength, review some of its uses in focusing the processing of documents for information retrieval and describe new results obtained in document categorization.
生物学文献在其复杂性、多样性以及庞大的数量方面,给信息处理带来了艰巨的挑战。其多样性很大程度上体现在技术术语上,这些术语也变得数量繁多。为了更有效地处理这个庞大的词汇表并改进信息处理,人们开发了一种聚焦方法,该方法允许根据术语在描述其出现的文档内容时的重要性度量对术语进行分类。这种度量称为术语强度,它衡量术语出现与数据库中文档主题的相关程度。如果术语出现是随机的,那么就不存在相关性,强度将为零,但如果对于任何主题,该术语要么总是出现要么从不出现,其强度将为一。我们在此给出术语强度的一种新的信息论解释,回顾其在聚焦文档处理以进行信息检索方面的一些用途,并描述在文档分类中获得的新结果。