Ruch Patrick
University Hospitals of Geneva, Medical Informatics Service CH-1201, Geneva.
Bioinformatics. 2006 Mar 15;22(6):658-64. doi: 10.1093/bioinformatics/bti783. Epub 2005 Nov 15.
We report on the development of a generic text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely data-independent.
In order to evaluate the robustness of our approach we test the system on two different biomedical terminologies: the Medical Subject Headings (MeSH) and the Gene Ontology (GO). Our lightweight categorizer, based on two ranking modules, combines a pattern matcher and a vector space retrieval engine, and uses both stems and linguistically-motivated indexing units.
Results show the effectiveness of phrase indexing for both GO and MeSH categorization, but we observe the categorization power of the tool depends on the controlled vocabulary: precision at high ranks ranges from above 90% for MeSH to <20% for GO, establishing a new baseline for categorizers based on retrieval methods.
我们报告了一个通用文本分类系统的开发,该系统旨在自动为任何输入文本分配生物医学类别。与通常依赖从大量训练数据中提取的数据密集型模型的自动文本分类系统不同,我们的分类器在很大程度上不依赖数据。
为了评估我们方法的稳健性,我们在两种不同的生物医学术语上测试该系统:医学主题词表(MeSH)和基因本体(GO)。我们基于两个排序模块的轻量级分类器结合了模式匹配器和向量空间检索引擎,并使用词干和基于语言学的索引单元。
结果表明短语索引对GO和MeSH分类均有效,但我们观察到该工具的分类能力取决于受控词汇表:高排名的精确率范围从MeSH的90%以上到GO的不到20%,为基于检索方法的分类器建立了新的基线。