Sakib Nazmus, Shahariar G M, Kabir Md Mohsinul, Hasan Md Kamrul, Mahmud Hasan
SSL Lab, Dept. of CSE, Islamic University of Technology, Dhaka, Bangladesh.
Dept. of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
PLoS One. 2025 Jan 28;20(1):e0317697. doi: 10.1371/journal.pone.0317697. eCollection 2025.
Sharing cooking recipes is a great way to exchange culinary ideas and provide instructions for food preparation. However, categorizing raw recipes found online into appropriate food genres can be challenging due to a lack of adequate labeled data. In this study, we present a dataset named the "Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Recipe Dataset" that contains two million culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. This collection of data includes various features such as title, NER, directions, and extended NER, as well as nine different labels representing genres including bakery, drinks, non-veg, vegetables, fast food, cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends the size of the Named Entity Recognition (NER) list to address missing named entities like heat, time or process from the recipe directions using two NER extraction tools. 3A2M+ dataset provides a comprehensive solution to the various challenging recipe-related tasks, including classification, named entity recognition, and recipe generation. Furthermore, we have demonstrated traditional machine learning, deep learning and pre-trained language models to classify the recipes into their corresponding genre and achieved an overall accuracy of 98.6%. Our investigation indicates that the title feature played a more significant role in classifying the genre.
分享烹饪食谱是交流烹饪理念和提供食物制备说明的好方法。然而,由于缺乏足够的标注数据,将网上找到的原始食谱分类到合适的食物类别中可能具有挑战性。在本研究中,我们提出了一个名为“分类、原型和注释两百万扩展(3A2M+)烹饪食谱数据集”的数据集,该数据集包含两百万个烹饪食谱,这些食谱在各自的类别中进行了标注,并从食谱描述中提取了扩展的命名实体。这个数据集包括各种特征,如标题、命名实体识别、制作说明和扩展命名实体,以及代表九种不同类别的标签,包括烘焙食品、饮品、非素食、蔬菜、快餐、谷类食品、餐食、配菜和融合菜。所提出的名为3A2M+的管道扩展了命名实体识别(NER)列表的规模,以使用两种NER提取工具解决食谱制作说明中缺失的命名实体,如加热、时间或过程。3A2M+数据集为各种具有挑战性的与食谱相关的任务提供了全面的解决方案,包括分类、命名实体识别和食谱生成。此外,我们展示了传统机器学习、深度学习和预训练语言模型将食谱分类到相应类别的能力,并达到了98.6%的总体准确率。我们的研究表明,标题特征在分类类别时发挥了更重要的作用。