Lewinski Nastassja A, Jimenez Ivan, McInnes Bridget T
Department of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, VA.
Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
Int J Nanomedicine. 2017 Oct 12;12:7519-7527. doi: 10.2147/IJN.S137117. eCollection 2017.
A vast amount of data on nanomedicines is being generated and published, and natural language processing (NLP) approaches can automate the extraction of unstructured text-based data. Annotated corpora are a key resource for NLP and information extraction methods which employ machine learning. Although corpora are available for pharmaceuticals, resources for nanomedicines and nanotechnology are still limited. To foster nanotechnology text mining (NanoNLP) efforts, we have constructed a corpus of annotated drug product inserts taken from the US Food and Drug Administration's Drugs@FDA online database. In this work, we present the development of the Engineered Nanomedicine Database corpus to support the evaluation of nanomedicine entity extraction. The data were manually annotated for 21 entity mentions consisting of nanomedicine physicochemical characterization, exposure, and biologic response information of 41 Food and Drug Administration-approved nanomedicines. We evaluate the reliability of the manual annotations and demonstrate the use of the corpus by evaluating two state-of-the-art named entity extraction systems, OpenNLP and Stanford NER. The annotated corpus is available open source and, based on these results, guidelines and suggestions for future development of additional nanomedicine corpora are provided.
关于纳米药物的大量数据正在产生并发表,自然语言处理(NLP)方法可以自动提取基于非结构化文本的数据。带注释的语料库是NLP和采用机器学习的信息提取方法的关键资源。虽然有针对药品的语料库,但纳米药物和纳米技术的资源仍然有限。为了促进纳米技术文本挖掘(NanoNLP)工作,我们构建了一个带注释的药品说明书语料库,这些说明书取自美国食品药品监督管理局(FDA)的Drugs@FDA在线数据库。在这项工作中,我们展示了工程纳米药物数据库语料库的开发,以支持纳米药物实体提取的评估。对41种美国食品药品监督管理局批准的纳米药物的21种实体提及进行了人工注释,这些实体提及包括纳米药物的物理化学特征、暴露情况和生物反应信息。我们评估了人工注释的可靠性,并通过评估两个最先进的命名实体提取系统OpenNLP和斯坦福命名实体识别器(Stanford NER)来展示该语料库的用途。该带注释的语料库以开源形式提供,并基于这些结果,为未来开发更多纳米药物语料库提供了指导方针和建议。