Suppr超能文献

构建一个元数据知识图谱作为揭秘人工智能管道优化的地图集。

Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization.

作者信息

Venkataramanan Revathy, Tripathy Aalap, Kumar Tarun, Serebryakov Sergey, Justine Annmary, Shah Arpit, Bhattacharya Suparna, Foltin Martin, Faraboschi Paolo, Roy Kaushik, Sheth Amit

机构信息

AI Institute, University of South Carolina, Columbia, SC, United States.

Hewlett Packard Enterprise Labs, Houston, TX, United States.

出版信息

Front Big Data. 2025 Jan 7;7:1476506. doi: 10.3389/fdata.2024.1476506. eCollection 2024.

Abstract

The emergence of advanced artificial intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms such as Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this study, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation, we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema-based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.

摘要

先进人工智能(AI)模型的出现推动了专注于自动化端到端AI管道模型训练和超参数调整的框架和方法的发展。然而,这些管道的其他关键阶段,如数据集选择、特征工程以及用于部署的模型优化,受到的关注较少。提高端到端AI管道的效率需要AI管道及其所有阶段过去执行的元数据。通过重新执行现有AI管道来重新生成元数据历史在计算上具有挑战性且不切实际。为了解决这个问题,我们建议从诸如Papers-with-Code、OpenML和Hugging Face等开源平台获取AI管道元数据。然而,整合和统一这些不同来源的不同术语和数据格式是一项挑战。在本研究中,我们通过引入通用元数据本体(CMO)提出了一种解决方案,该本体用于构建一个由160万个管道组成的广泛的AI管道元数据知识图谱(AIMKG)。通过语义增强,AIMKG中的管道元数据也被丰富以用于下游任务,如AI管道的搜索和推荐。我们对AIMKG进行定量和定性评估,以根据用户查询搜索和推荐相关管道。对于定量评估,我们提出了一个定制的聚合模型,该模型通过实现76.3%的检索准确率(R@1)优于其他基线。我们的定性分析表明,与基于最先进的MLSchema的推荐器相比,基于AIMKG的推荐器在78%的测试用例中检索到了相关管道,而基于MLSchema的推荐器在51%的案例中检索到了相关响应。AIMKG作为导航不断发展的AI领域的地图集,为从业者提供了用于其应用的全面情况说明书。它指导AI管道优化,为改进AI管道提供见解和建议,并作为对不断发展的AI工作流程进行数据挖掘和分析的基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b39/11748301/3ba7a13a2233/fdata-07-1476506-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验