Hu Guanlan, Ahmed Mavra, L'Abbé Mary R
Department of Nutritional Sciences, Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
Department of Nutritional Sciences, Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Joannah & Brian Lawson Centre for Child Nutrition, University of Toronto, ON, Canada.
Am J Clin Nutr. 2023 Mar;117(3):553-563. doi: 10.1016/j.ajcnut.2022.11.022. Epub 2022 Dec 23.
Food categorization and nutrient profiling are labor intensive, time consuming, and costly tasks, given the number of products and labels in large food composition databases and the dynamic food supply.
This study used a pretrained language model and supervised machine learning to automate food category classification and nutrition quality score prediction based on manually coded and validated data, and compared prediction results with models using bag-of-words and structured nutrition facts as inputs for predictions.
Food product information from University of Toronto Food Label Information and Price Database 2017 (n = 17,448) and University of Toronto Food Label Information and Price Database 2020 (n = 74,445) databases were used. Health Canada's Table of Reference Amounts (TRA) (24 categories and 172 subcategories) was used for food categorization and the Food Standards of Australia and New Zealand (FSANZ) nutrient profiling system was used for nutrition quality score evaluation. TRA categories and FSANZ scores were manually coded and validated by trained nutrition researchers. A modified pretrained sentence-Bidirectional Encoder Representations from Transformers model was used to encode unstructured text from food labels into lower-dimensional vector representations, followed by supervised machine learning algorithms (i.e., elastic net, k-Nearest Neighbors, and XGBoost) for multiclass classification and regression tasks.
Pretrained language model representations utilized by the XGBoost multiclass classification algorithm reached overall accuracy scores of 0.98 and 0.96 in predicting food TRA major and subcategories, outperforming bag-of-words methods. For FSANZ score prediction, our proposed method reached a similar prediction accuracy (R: 0.87 and MSE: 14.4) compared with bag-of-words methods (R: 0.72-0.84; MSE: 30.3-17.6), whereas structured nutrition facts machine learning model performed the best (R: 0.98; MSE: 2.5). The pretrained language model had a higher generalizable ability on the external test datasets than bag-of-words methods.
Our automation achieved high accuracy in classifying food categories and predicting nutrition quality scores using text information found on food labels. This approach is effective and generalizable in a dynamic food environment, where large amounts of food label data can be obtained from websites.
鉴于大型食品成分数据库中的产品数量和标签数量以及动态的食品供应情况,食品分类和营养成分分析是劳动密集型、耗时且成本高昂的任务。
本研究使用预训练语言模型和监督机器学习,基于人工编码和验证的数据实现食品类别分类和营养质量得分预测自动化,并将预测结果与使用词袋模型和结构化营养成分信息作为预测输入的模型进行比较。
使用了来自多伦多大学2017年食品标签信息与价格数据库(n = 17,448)和多伦多大学2020年食品标签信息与价格数据库(n = 74,445)的食品产品信息。加拿大卫生部的参考摄入量表(TRA)(24个类别和172个子类别)用于食品分类,澳大利亚和新西兰食品标准(FSANZ)营养成分分析系统用于营养质量得分评估。TRA类别和FSANZ得分由训练有素的营养研究人员进行人工编码和验证。使用经过修改的预训练的基于变换器的句子双向编码器表征模型,将食品标签中的非结构化文本编码为低维向量表征,随后使用监督机器学习算法(即弹性网络、k近邻和极端梯度提升)进行多类分类和回归任务。
极端梯度提升多类分类算法使用的预训练语言模型表征在预测食品TRA主要类别和子类别时,总体准确率分别达到0.98和0.96,优于词袋模型方法。对于FSANZ得分预测,与词袋模型方法(R:0.72 - 0.84;均方误差:30.3 - 17.6)相比,我们提出的方法达到了相似的预测准确率(R:0.87;均方误差:14.4),而结构化营养成分信息机器学习模型表现最佳(R:0.98;均方误差:2.5)。预训练语言模型在外部测试数据集上比词袋模型方法具有更高的泛化能力。
我们的自动化方法在使用食品标签上的文本信息对食品类别进行分类和预测营养质量得分方面取得了高精度。这种方法在动态食品环境中是有效且可推广的,在该环境中可以从网站获取大量食品标签数据。