Suppr超能文献

FoodBase 语料库:一个新的带注释食物实体资源。

FoodBase corpus: a new resource of annotated food entities.

机构信息

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia.

Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia.

出版信息

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz121.

Abstract

The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: 'Appetizers and Snacks', 'Breakfast and Lunch', 'Dessert', 'Dinner' and 'Drinks'. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.

摘要

注释文本语料库的存在对于基于自然语言处理 (NLP) 和文本挖掘的公共卫生服务和工具的发展至关重要。最近组织的生物医学 NLP 共享任务提供了与不同生物医学实体相关的注释语料库,例如基因、表型、药物、疾病和化学实体。这些语料库用于开发命名实体识别 (NER) 模型,这些模型用于从文本中提取实体并找到它们的关系。然而,据我们所知,尽管食品和饮食管理是一个重要的公共卫生问题,但提供有关食品实体信息的注释语料库有限。因此,我们开发了一个新的食品实体注释语料库,名为 FoodBase。它是使用从 Allrecipes 中提取的食谱构建的,Allrecipes 是目前最大的专注于食品的社交网络。食谱选自五个类别:“开胃菜和小吃”、“早餐和午餐”、“甜点”、“晚餐”和“饮料”。用于注释食品实体的语义标签是从 Hansard 语料库中选择的。为了提取和注释食品实体,我们应用了一种名为 FoodIE 的基于规则的食品 NER 方法。由于 FoodIE 提供了一个弱注释语料库,因此我们通过在 1000 个食谱上手动评估获得的结果,创建了 FoodBase 的黄金标准。它包含 12844 个食品实体注释,描述了 2105 个独特的食品实体。此外,我们还提供了一个额外的 21790 个食谱的弱注释语料库。它包含 274053 个食品实体注释,其中 13079 个是唯一的。FoodBase 语料库对于开发食品科学的基于语料库的 NER 模型是必要的,它是机器学习任务(如多类分类、多标签分类和层次多标签分类)的新基准数据集。FoodBase 可用于检测食品概念之间的语义差异/相似性,并且我们相信它最终将为学习可用于预测研究的食品嵌入空间开辟新途径。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验