Ispirova Gordana, Cenikj Gjorgjina, Ogrinc Matevž, Valenčič Eva, Stojanov Riste, Korošec Peter, Cavalli Ermanno, Koroušić Seljak Barbara, Eftimov Tome
Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia.
Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia.
Foods. 2022 Sep 2;11(17):2684. doi: 10.3390/foods11172684.
Besides the numerous studies in the last decade involving food and nutrition data, this domain remains low resourced. Annotated corpuses are very useful tools for researchers and experts of the domain in question, as well as for data scientists for analysis. In this paper, we present the annotation process of food consumption data (recipes) with semantic tags from different semantic resources-Hansard taxonomy, FoodOn ontology, SNOMED CT terminology and the FoodEx2 classification system. FoodBase is an annotated corpus of food entities-recipes-which includes a curated version of 1000 instances, considered a gold standard. In this study, we use the curated version of FoodBase and two different approaches for annotating-the NCBO annotator (for the FoodOn and SNOMED CT annotations) and the semi-automatic StandFood method (for the FoodEx2 annotations). The end result is a new version of the golden standard of the FoodBase corpus, called the CafeteriaFCD (Cafeteria Food Consumption Data) corpus. This corpus contains food consumption data-recipes-annotated with semantic tags from the aforementioned four different external semantic resources. With these annotations, data interoperability is achieved between five semantic resources from different domains. This resource can be further utilized for developing and training different information extraction pipelines using state-of-the-art NLP approaches for tracing knowledge about food safety applications.
除了过去十年中涉及食品和营养数据的大量研究外,该领域的资源仍然匮乏。带注释的语料库对于相关领域的研究人员和专家以及用于分析的数据科学家来说都是非常有用的工具。在本文中,我们展示了使用来自不同语义资源(汉萨德分类法、FoodOn本体、SNOMED CT术语和FoodEx2分类系统)的语义标签对食品消费数据(食谱)进行注释的过程。FoodBase是一个带注释的食品实体(食谱)语料库,其中包括1000个实例的精选版本,被视为黄金标准。在本研究中,我们使用FoodBase的精选版本以及两种不同的注释方法——NCBO注释器(用于FoodOn和SNOMED CT注释)和半自动的StandFood方法(用于FoodEx2注释)。最终结果是FoodBase语料库黄金标准的一个新版本,称为CafeteriaFCD(自助餐厅食品消费数据)语料库。该语料库包含用上述四种不同外部语义资源的语义标签注释的食品消费数据(食谱)。通过这些注释,实现了来自不同领域的五种语义资源之间的数据互操作性。该资源可进一步用于使用最先进的自然语言处理方法开发和训练不同的信息提取管道,以追踪食品安全应用方面的知识。