Ozyurt I Burak, Brown Gregory G
Department of Psychiatry, University of California - San Diego, La Jolla, CA, USA.
Methods Mol Biol. 2009;569:173-96. doi: 10.1007/978-1-59745-524-4_9.
Ever-increasing size of the biomedical literature makes more precise information retrieval and tapping into implicit knowledge in scientific literature a necessity. In this chapter, first, three new variants of the expectation-maximization (EM) method for semisupervised document classification (Machine Learning 39:103-134, 2000) are introduced to refine biomedical literature meta-searches. The retrieval performance of a multi-mixture per class EM variant with Agglomerative Information Bottleneck clustering (Slonim and Tishby (1999) Agglomerative information bottleneck. In Proceedings of NIPS-12) using Davies-Bouldin cluster validity index (IEEE Transactions on Pattern Analysis and Machine Intelligence 1:224-227, 1979), rivaled the state-of-the-art transductive support vector machines (TSVM) (Joachims (1999) Transductive inference for text classification using support vector machines. In Proceedings of the International Conference on Machine Learning (ICML)). Moreover, the multi-mixture per class EM variant refined search results more quickly with more than one order of magnitude improvement in execution time compared with TSVM. A second tool, CRFNER, uses conditional random fields (Lafferty et al. (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-2001) to recognize 15 types of named entities from schizophrenia abstracts outperforming ABNER (Settles (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)) in biological named entity recognition and reaching F(1) performance of 82.5% on the second set of named entities.
生物医学文献规模的不断扩大,使得更精确的信息检索以及挖掘科学文献中的隐性知识成为必要。在本章中,首先介绍了期望最大化(EM)方法用于半监督文档分类的三种新变体(《机器学习》39:103 - 134,2000年),以优化生物医学文献元搜索。使用凝聚信息瓶颈聚类(斯洛尼姆和蒂什比(1999年)凝聚信息瓶颈。发表于《神经信息处理系统大会论文集 - 12》)和戴维斯 - 布尔丁聚类有效性指标(《IEEE模式分析与机器智能汇刊》1:224 - 227,1979年)的每类多混合EM变体的检索性能,可与当前最先进的转导支持向量机(TSVM)(约阿希姆斯(1999年)使用支持向量机的文本分类转导推理。发表于《机器学习国际会议论文集》(ICML))相媲美。此外,每类多混合EM变体更快地优化了搜索结果,与TSVM相比,执行时间提高了一个多数量级。第二个工具CRFNER使用条件随机场(拉弗蒂等人(2001年)条件随机场:用于分割和标记序列数据的概率模型。发表于《ICML - 2001会议论文集》)从精神分裂症摘要中识别15种命名实体,在生物命名实体识别方面优于ABNER(塞茨(2004年)使用条件随机场和丰富特征集的生物医学命名实体识别。发表于《COLING 2004生物医学自然语言处理及其应用国际联合研讨会论文集》(NLPBA)),在第二组命名实体上达到了82.5%的F(1)性能。