FoodBase 语料库：一个新的带注释食物实体资源。

FoodBase corpus: a new resource of annotated food entities.

机构信息

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia.

Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia.

出版信息

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz121.

DOI:10.1093/database/baz121

PMID:31682732

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6827550/

Abstract

The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: 'Appetizers and Snacks', 'Breakfast and Lunch', 'Dessert', 'Dinner' and 'Drinks'. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.

摘要

注释文本语料库的存在对于基于自然语言处理 (NLP) 和文本挖掘的公共卫生服务和工具的发展至关重要。最近组织的生物医学 NLP 共享任务提供了与不同生物医学实体相关的注释语料库，例如基因、表型、药物、疾病和化学实体。这些语料库用于开发命名实体识别 (NER) 模型，这些模型用于从文本中提取实体并找到它们的关系。然而，据我们所知，尽管食品和饮食管理是一个重要的公共卫生问题，但提供有关食品实体信息的注释语料库有限。因此，我们开发了一个新的食品实体注释语料库，名为 FoodBase。它是使用从 Allrecipes 中提取的食谱构建的，Allrecipes 是目前最大的专注于食品的社交网络。食谱选自五个类别：“开胃菜和小吃”、“早餐和午餐”、“甜点”、“晚餐”和“饮料”。用于注释食品实体的语义标签是从 Hansard 语料库中选择的。为了提取和注释食品实体，我们应用了一种名为 FoodIE 的基于规则的食品 NER 方法。由于 FoodIE 提供了一个弱注释语料库，因此我们通过在 1000 个食谱上手动评估获得的结果，创建了 FoodBase 的黄金标准。它包含 12844 个食品实体注释，描述了 2105 个独特的食品实体。此外，我们还提供了一个额外的 21790 个食谱的弱注释语料库。它包含 274053 个食品实体注释，其中 13079 个是唯一的。FoodBase 语料库对于开发食品科学的基于语料库的 NER 模型是必要的，它是机器学习任务（如多类分类、多标签分类和层次多标签分类）的新基准数据集。FoodBase 可用于检测食品概念之间的语义差异/相似性，并且我们相信它最终将为学习可用于预测研究的食品嵌入空间开辟新途径。

相似文献

FoodBase corpus: a new resource of annotated food entities.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz121.

CafeteriaFCD Corpus: Food Consumption Data Annotated with Regard to Different Food Semantic Resources.

Foods. 2022 Sep 2;11(17):2684. doi: 10.3390/foods11172684.

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.

J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms.

Sci Data. 2023 Oct 19;10(1):722. doi: 10.1038/s41597-023-02617-x.

An annotated corpus with nanomedicine and pharmacokinetic parameters.

Int J Nanomedicine. 2017 Oct 12;12:7519-7527. doi: 10.2147/IJN.S137117. eCollection 2017.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

An annotated corpus from biomedical articles to construct a drug-food interaction database.

J Biomed Inform. 2022 Feb;126:103985. doi: 10.1016/j.jbi.2022.103985. Epub 2022 Jan 7.

Concept annotation in the CRAFT corpus.

BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect.

Data Brief. 2023 May 12;48:109234. doi: 10.1016/j.dib.2023.109234. eCollection 2023 Jun.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.

引用本文的文献

Zero-shot evaluation of ChatGPT for food named-entity recognition and linking.

Front Nutr. 2024 Aug 13;11:1429259. doi: 10.3389/fnut.2024.1429259. eCollection 2024.

Decoding the Foodome: Molecular Networks Connecting Diet and Health.

Annu Rev Nutr. 2024 Aug;44(1):257-288. doi: 10.1146/annurev-nutr-062322-030557.

From language models to large-scale food and biomedical knowledge graphs.

Sci Rep. 2023 May 15;13(1):7815. doi: 10.1038/s41598-023-34981-4.

CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources.

Database (Oxford). 2022 Dec 16;2022. doi: 10.1093/database/baac107.

CafeteriaFCD Corpus: Food Consumption Data Annotated with Regard to Different Food Semantic Resources.

Foods. 2022 Sep 2;11(17):2684. doi: 10.3390/foods11172684.

Applications of knowledge graphs for food science and industry.

Patterns (N Y). 2022 May 13;3(5):100484. doi: 10.1016/j.patter.2022.100484.

Food Recipe Ingredient Substitution Ontology Design Pattern.

Sensors (Basel). 2022 Jan 31;22(3):1095. doi: 10.3390/s22031095.

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.

J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.

RecipeDB: a resource for exploring recipes.

Database (Oxford). 2020 Nov 25;2020. doi: 10.1093/database/baaa077.

COVID-19 pandemic changes the food consumption patterns.

Trends Food Sci Technol. 2020 Oct;104:268-272. doi: 10.1016/j.tifs.2020.08.017. Epub 2020 Sep 2.

本文引用的文献

FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration.

NPJ Sci Food. 2018 Dec 18;2:23. doi: 10.1038/s41538-018-0032-6. eCollection 2018.

A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations.

PLoS One. 2017 Jun 23;12(6):e0179488. doi: 10.1371/journal.pone.0179488. eCollection 2017.

Using text mining techniques to extract phenotypic information from the PhenoCHF corpus.

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S3. doi: 10.1186/1472-6947-15-S2-S3. Epub 2015 Jun 15.

BioC interoperability track overview.

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

BioCreative-IV virtual issue.

Database (Oxford). 2014 May 22;2014. doi: 10.1093/database/bau039. Print 2014.

BioC: a minimalist approach to interoperability for biomedical text processing.

Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.

ChemSpot: a hybrid system for chemical named entity recognition.

Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.

Overview of the BioCreative III Workshop.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-12-S8-S1.

BioPortal: ontologies and integrated data resources at the click of a mouse.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W170-3. doi: 10.1093/nar/gkp440. Epub 2009 May 29.

Overview of BioCreative II gene normalization.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

FoodBase 语料库：一个新的带注释食物实体资源。

FoodBase corpus: a new resource of annotated food entities.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献