Department of Computer Engineering, U & P U. Patel, CSPIT, CHARUSAT, Changa, Gujarat, India.
Department of Artificial Intelligence and Machine Learning, CSPIT, CHARUSAT, Changa, Gujarat, India.
Sci Rep. 2024 Sep 27;14(1):22329. doi: 10.1038/s41598-024-73643-x.
The Artificial Intelligence has evolved and is now associated with Deep Learning, driven by availability of vast amount of data and computing power. Traditionally, researchers have adopted a Model-Centric Approach, focusing on developing new algorithms and models to enhance performance without altering the underlying data. However, Andrew Ng, a prominent figure in the AI community, has recently emphasized on better (quality) data rather than better models, which has given birth to Data Centric Approach, also known as Data Oriented technique. The transition from model oriented to data oriented approach has rapidly gained momentum within the realm of deep learning. Despite its promise, the Data-Centric Approach faces several challenges, including (a) generating high-quality data, (b) ensuring data privacy, and (c) addressing biases to achieve fairness in datasets. Currently, there has been limited effort in preparing quality data. Our work aims to address this gap by focusing on the generation of high-quality data through methods such as data augmentation, multi-stage hashing to eliminate duplicate instances, to detect and correct noisy labels, using confident learning. The experiments on popular datasets, namely MNIST, Fashion MNIST, and CIFAR-10 were performed by utilizing ResNet-18 as the common framework followed by both Model Centric and Data Centric Approach. Comparative performance analysis revealed that the Data Centric Approach consistently outperformed the Model Centric Approach by a relative margin of at least 3%. This finding highlights the potential for further exploration and adoption of the Data-Centric Approach in various domains such as healthcare, finance, education, and entertainment, where the quality of data could significantly enhance the performance.
人工智能已经发展到现在与深度学习相关联的地步,这是由大量数据和计算能力的可用性所驱动的。传统上,研究人员采用了以模型为中心的方法,专注于开发新的算法和模型来提高性能,而不改变底层数据。然而,人工智能领域的杰出人物安德鲁·吴(Andrew Ng)最近强调了更好的数据(质量)而不是更好的模型,这催生了以数据为中心的方法,也称为面向数据的技术。从模型导向到数据导向的方法的转变在深度学习领域迅速获得了动力。尽管有其前景,但数据中心方法面临着几个挑战,包括生成高质量数据、确保数据隐私以及解决数据集公平性中的偏差问题。目前,在准备高质量数据方面的努力有限。我们的工作旨在通过数据增强、多阶段哈希以消除重复实例、检测和纠正嘈杂标签、使用置信学习等方法来生成高质量数据,从而解决这个差距。在 MNIST、Fashion MNIST 和 CIFAR-10 等流行数据集上进行了实验,使用 ResNet-18 作为通用框架,分别采用了以模型为中心和以数据为中心的方法。对比性能分析表明,数据中心方法始终优于以模型为中心的方法,相对差距至少为 3%。这一发现强调了在医疗保健、金融、教育和娱乐等各个领域进一步探索和采用以数据为中心的方法的潜力,在这些领域,数据的质量可以显著提高性能。