Schneider J H
Science. 1971 Jul 23;173(3994):300-8. doi: 10.1126/science.173.3994.300.
Selective dissemination of information to individuals provides a new and promising method for keeping abreast of current scientific information. Since SDI services are directed to the information needs of each individual, they are a significant step beyond grouporiented services and products, which require considerable expenditure of effort by each user as he sorts useful information from trash. However, SDI systems do require a high degree of precision in matching scientists against documents. They must operate more efficiently and economically than many current systems which occasionally provide a useful item of information to users. To meet these stringent requirements for quality, precision, efficiency, and economy, more research must be devoted to comparing and improving indexing methods, which are the basic component of all information storage and retrieval systems. It is incredible that so much money has been spent on the development and operation of scientific information systems before basic data on the comparative performance of various indexing methods have been gathered, analyzed, and confirmed by multiple investigators. The design of an effective information system would seem to require this type of basic knowledge, just as basic properties of alternative materials must be known before an engineer can design a building, bridge, or factory. Yet, except for the few studies mentioned in the previous section, research on indexing methods has been greatly neglected. Bourne's comment about studies of indexing languages is still an appropriate description of the situation: "In almost all the experimental reports, the investigator worked with an indexing language different than that of other experimenters. Consequently, no one has ever had his test results verified, or expanded, or made more precise by another experimenter" (47). Most existing information systems are based on keyword indexing, with concepts broken into isolated terms during input operations and recombined to synthesize the original concept during search and retrieval. Such systems tend to involve imprecise indexing, with a high level of "noise" in retrieved documents, difficult search strategy involving extensive post-coordination, and lengthy, complex computer manipulations. This situation reflects the fact that many producers of indexed data originally focused the design of their systems on the production of a published product with entries printed under short, concise index headings. Production of magnetic tapes as a by-product of the publication process, and their use for retrospective searching or for SDI services, was a much later development, almost an afterthought. Yet use of these tapes is growing so rapidly that it may be time to redesign the tape-producing systems, with ease of tape use for SDI services and retrospective searching as the primary consideration, and with publication of abstract and index bulletins or title listings relegated to secondary importance (49). The use of keywords to index documents creates a high degree of disorganization in information search and retrieval operations: Information is scattered under the many different terms that can be used to index different aspects of a concept. If the large-scale, comprehensive abstracting and indexing services were based on enumerative classifications with assignment of documents to logical hierarchical categories at the time of initial indexing, then many of the specialized information centers (50) and the 1300 abstracting and indexing services (3) would be unnecessary, and much of the reindexing and reprocessing of documents, the repackaging and reworking of abstracts and index data, and the resulting overlap and duplication characteristic of current information processing could be terminated. Partly because of the disorganization resulting from keyword indexing, the cost of a 5-year retrospective search of information on just one data base on magnetic tapes is a major investment (16). The effort and cost required to find a few items of useful information scattered among 1,285,000 abstracts indexed on 116 full reels of magnetic tape (11 million characters per reel) which will be needed for the 5-year Eighth Collective Index to Chemical Abstracts (1967-1971) (51) staggers the imagination. In contrast, when HICLASS systems based on enumerative hierarchical classifications are used, concepts that might be useful for later retrieval are identified and related items of information are grouped together during the indexing process. These enumerative classifications, with single-hit matching, make it possible to index and retrieve ideas as intact units and to perform simple sequential searches of the very small segment of a file that deals with a given topic (31). The experiments at both the Science Information Exchange and the National Cancer Institute, as described in this article, demonstrate that automated HICLASS systems are feasible and can operate at a very satisfactory level of performance. Although considerable effort may be required for the development and constant updating of detailed enumerative classifications, HICLASS categories may facilitate organization of data at the time of input, improve the precision of matching documents with users, and greatly simplify search logic and computer manipulations. If so, then output savings and performance would more than justify input costs, and the development and use of enumerative classifications would be a better solution to information problems than the current keyword-and-coordination approach. It is time to think beyond the ease of the single input step in information systems and to take a hard look at ways of easing retrieval problems for the multitude of information systems that process the indexed data (52). Indexing effort is expended only once, whereas search and retrieval effort is required by every user of a system. If information were better analyzed and organized during input operations, if more basic research were devoted to the effect of indexing methods on the performance of information systems, and if more emphasis were placed on the quality and usefulness of retrieved information, then the magnitude of problems related to the storage and retrieval of scientific information might be considerably reduced.
向个人选择性地传播信息为及时了解当前科学信息提供了一种新的、有前景的方法。由于定题情报服务(SDI)是针对每个个体的信息需求,所以相较于面向群体的服务和产品,它们向前迈进了重要的一步。对于面向群体的服务和产品,每个用户在从大量信息中筛选出有用信息时都需要花费大量精力。然而,SDI系统在将科学家与文献进行匹配时确实需要高度的精准度。它们必须比许多当前的系统更高效、更经济地运行,而当前的系统只是偶尔为用户提供一条有用的信息。为了满足对质量、精准度、效率和经济性的严格要求,必须投入更多研究来比较和改进索引方法,因为索引方法是所有信息存储和检索系统的基本组成部分。令人难以置信的是,在各种索引方法的比较性能的基础数据被多个研究者收集、分析和确认之前,已经在科学信息系统的开发和运行上花费了如此多的资金。设计一个有效的信息系统似乎需要这类基础知识,就如同工程师在设计建筑物、桥梁或工厂之前必须了解替代材料的基本特性一样。然而,除了上一节提到的少数研究外,索引方法的研究一直被严重忽视。伯恩对索引语言研究的评论仍然恰如其分地描述了这种情况:“在几乎所有的实验报告中,研究者使用的索引语言与其他实验者不同。因此,从来没有人的测试结果被其他实验者验证、扩展或精确化”(47)。大多数现有的信息系统基于关键词索引,在输入操作过程中概念被分解为孤立的术语,并在搜索和检索过程中重新组合以合成原始概念。这样的系统往往涉及不精确的索引,检索到的文献中有大量“噪音”,搜索策略复杂,需要大量的后协调操作,以及冗长、复杂的计算机处理。这种情况反映了这样一个事实,即许多索引数据的生产者最初将系统设计的重点放在生产一种已出版的产品上,其条目在简短、简洁的索引标题下印刷。作为出版过程的副产品生产磁带,并将其用于回溯检索或SDI服务,是后来才发展起来的,几乎是事后才想到的。然而,这些磁带的使用增长如此迅速,以至于可能是时候重新设计磁带生产系统了,将便于磁带用于SDI服务和回溯检索作为主要考虑因素,而将摘要和索引公告或标题列表的出版降至次要地位(49)。使用关键词对文献进行索引在信息搜索和检索操作中造成了高度的混乱:信息分散在许多不同的术语之下,这些术语可用于索引一个概念的不同方面。如果大规模、全面的文摘和索引服务基于枚举分类,在初始索引时将文献分配到逻辑层次类别中,那么许多专业信息中心(50)和1300种文摘和索引服务(3)将是不必要的,并且当前信息处理中许多文献的重新索引和重新处理、摘要和索引数据的重新包装和重新加工以及由此产生的重叠和重复特征都可以终止。部分由于关键词索引导致的混乱,仅对一个磁带数据库进行5年回溯信息搜索的成本是一项重大投资(16)。要在116个满卷磁带(每卷1100万个字符)上索引的1285000篇摘要中找到几条有用信息所需的努力和成本令人难以想象,这些磁带将用于《化学文摘》的5年第八期累积索引(1967 - 1971年)(51)。相比之下,当使用基于枚举层次分类的HICLASS系统时,在索引过程中会识别出可能对后续检索有用的概念,并将相关的信息项组合在一起。这些枚举分类通过单次匹配,使得能够将思想作为完整的单元进行索引和检索,并对文件中处理给定主题的非常小的部分进行简单的顺序搜索(31)。如本文所述,科学信息交换所和国家癌症研究所的实验表明,自动化的HICLASS系统是可行的,并且可以在非常令人满意的性能水平上运行。尽管开发和不断更新详细的枚举分类可能需要相当大的努力,但HICLASS类别可以在输入时促进数据的组织,提高将文献与用户匹配的精准度,并极大地简化搜索逻辑和计算机处理。如果是这样,那么输出节省和性能将足以证明输入成本的合理性,并且枚举分类的开发和使用将比当前的关键词和协调方法更好地解决信息问题。是时候超越信息系统中单个输入步骤的便利性,认真审视为众多处理索引数据的信息系统简化检索问题的方法了(52)。索引工作只进行一次,而系统的每个用户都需要进行搜索和检索工作。如果在输入操作过程中对信息进行更好的分析和组织,如果更多的基础研究致力于索引方法对信息系统性能的影响,如果更多地强调检索信息的质量和有用性,那么与科学信息存储和检索相关的问题的严重程度可能会大大降低。