无需事先注释的基于命名法的数据检索：通过快速双峰匹配促进生物医学数据整合。

Cancer Diagnosis Program, National Cancer Insititute, National Institutes of Health, Bethesda, Maryland, USA.

In Silico Biol. 2005;5(3):313-22. Epub 2005 Apr 3.

Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vocabularies that codify the names of diseases, procedures, billing categories, etc. Molecular biologists need codified medical records so that they can discover or validate relationships between experimental data and clinical data. This paper introduces a new approach to retrieving data records without prior coding. The approach achieves the same result as a search over pre-coded records. It retrieves all records that contain any terms that are synonymous with a user's query-term. A recently described fast algorithm (the doublet method) permits quick iterative searches over every synonym for any term from any nomenclature occurring in a dataset of any size. As a demonstration, a 105+ Megabyte corpus of Pubmed abstracts was searched for medical terms. Query terms were matched against either of two vocabularies and expanded as an array of equivalent search items. A single search term may have over one hundred nomenclature synonyms, all of which were searched against the full database. Iterative searches of a list of concept-equivalent terms involves many more operations than a single search over pre-annotated concept codes. Nonetheless, the doublet method achieved fast query response times (0.05 seconds using Snomed and 5 seconds using the Developmental Lineage Classification of Neoplasms, on a computer with a 2.89 GHz processor). Pre-annotated datasets lose their value when the chosen vocabulary is replaced by a different vocabulary or by a different version of the same vocabulary. The doublet method can employ any version of any vocabulary with no pre-annotation. In many instances, the enormous effort and expense associated with data annotation can be eliminated by on-the-fly doublet matching. The algorithm for nomenclature-based database searches using the doublet method is described. Perl scripts for implementing the algorithm and testing execution speed are provided as open source documents available from the Association for Pathology Informatics (www.pathologyinformatics.org/informatics_r.htm).

为生物医学数据分配命名代码是一项艰巨、昂贵且容易出错的任务。数据记录被编码以提供所含概念的通用表示形式，从而允许通过标准术语轻松检索记录。在医学领域，癌症登记员、护士、病理学家和私人临床医生都明白使用对疾病、手术、计费类别等名称进行编码的词汇表注释医疗记录的重要性。分子生物学家需要经过编码的医疗记录，以便他们能够发现或验证实验数据与临床数据之间的关系。本文介绍了一种无需预先编码即可检索数据记录的新方法。该方法与对预先编码的记录进行搜索能达到相同的结果。它检索所有包含与用户查询词同义的任何术语的记录。最近描述的一种快速算法（双元法）允许对任何大小的数据集中出现的任何术语的任何命名法中的每个同义词进行快速迭代搜索。作为演示，在一个105兆字节以上的PubMed摘要语料库中搜索医学术语。查询词与两种词汇表中的任何一种进行匹配，并扩展为等效搜索项的数组。一个单一的搜索词可能有一百多个命名同义词，所有这些同义词都要在整个数据库中进行搜索。对一系列概念等效术语进行迭代搜索所涉及的操作比在预先注释的概念代码上进行单次搜索要多得多。尽管如此，双元法实现了快速的查询响应时间（在一台2.89 GHz处理器的计算机上，使用SNOMED时为0.05秒，使用肿瘤发育谱系分类时为5秒）。当所选词汇表被不同的词汇表或同一词汇表的不同版本取代时，预先注释的数据集就失去了价值。双元法可以使用任何词汇表的任何版本，无需预先注释。在许多情况下，通过即时双元匹配可以消除与数据注释相关的巨大努力和费用。描述了使用双元法进行基于命名法的数据库搜索的算法。用于实现该算法和测试执行速度的Perl脚本作为开源文档提供，可从病理信息学协会（www.pathologyinformatics.org/informatics_r.htm）获取。