Akhondi Saber A, Klenner Alexander G, Tyrchan Christian, Manchala Anil K, Boppana Kiran, Lowe Daniel, Zimmermann Marc, Jagarlapudi Sarma A R P, Sayle Roger, Kors Jan A, Muresan Sorel
Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands.
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Fraunhofer-Gesellschaft, Sankt Augustin, Germany.
PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.
在早期药物化学活动中,探索专利申请所涵盖的化学和生物空间至关重要。专利分析可以提供对化合物现有技术的理解、新颖性检查、生物测定的验证以及化学探索新起点的识别。通过专家编目员手动提取从专利中提取化学和生物实体可能需要大量时间和资源。文本挖掘方法有助于简化这一过程。为了验证此类方法的性能,一个手动注释的专利语料库至关重要。在本研究中,我们制作了一个大型的金标准化学专利语料库。我们制定了注释指南,并从世界知识产权组织、美国专利商标局和欧洲专利局选择了200项完整专利。这些专利预先进行了自动注释,并提供给四个独立的注释小组,每个小组由两到十名注释员组成。注释员标记了不同子类中的化学物质、疾病、靶点和作用方式。由于光学字符识别错误导致的拼写错误和虚假换行也进行了注释。47项专利的一个子集由至少三个注释小组进行了注释,从中得出了统一的注释和注释者间一致性分数。一个小组注释了完整的集合。该专利语料库包括完整集合的400,125条注释和统一集合的36,5,37条注释。所有专利和注释实体均可在www.biosemantics.org上公开获取。