Tsai Richard Tzong-Han, Hsiao Yu-Cheng, Lai Po-Ting
Database (Oxford). 2016 Oct 25;2016:baw135. doi: 10.1093/database/baw135.
Chemical patents contain detailed information on novel chemical compounds that is valuable to the chemical and pharmaceutical industries. In this paper, we introduce a system, NERChem that can recognize chemical named entity mentions in chemical patents. NERChem is based on the conditional random fields model (CRF). Our approach incorporates ( 1 ) class composition, which is used for combining chemical classes whose naming conventions are similar; ( 2 ) BioNE features, which are used for distinguishing chemical mentions from other biomedical NE mentions in the patents; and ( 3 ) full-token word features, which are used to resolve the tokenization granularity problem. We evaluated our approach on the BioCreative V CHEMDNER-patent corpus, and achieved an F-score of 87.17% in the Chemical Entity Mention in Patents (CEMP) task and a sensitivity of 98.58% in the Chemical Passage Detection (CPD) task, ranking alongside the top systems. Database URL: Our NERChem web-based system is publicly available at iisrserv.csie.n cu.edu.tw/nerchem.
化学专利包含有关新型化合物的详细信息,这些信息对化学和制药行业具有重要价值。在本文中,我们介绍了一种名为NERChem的系统,它能够识别化学专利中提及的化学命名实体。NERChem基于条件随机场模型(CRF)。我们的方法包括:(1)类组合,用于组合命名惯例相似的化学类别;(2)BioNE特征,用于在专利中区分化学提及与其他生物医学命名实体提及;(3)全词元词特征,用于解决词元化粒度问题。我们在BioCreative V CHEMDNER-专利语料库上评估了我们的方法,在专利中的化学实体提及(CEMP)任务中获得了87.17%的F值,在化学段落检测(CPD)任务中获得了98.58%的灵敏度,与顶级系统并列。数据库网址:我们基于网络的NERChem系统可在iisrserv.csie.n cu.edu.tw/nerchem上公开获取。