Jain Mansi, Kaur Harmeet, Gupta Bhavna, Gera Jaya, Kalra Vandana
Department of Computer Science, Shyama Prasad Mukherji College for Women, University of Delhi, Delhi, India.
Department of Computer Science, Hansraj College, University of Delhi, Delhi, India.
Sci Rep. 2025 Jan 2;15(1):272. doi: 10.1038/s41598-024-78785-6.
Domain-specific vocabulary, which is crucial in fields such as Information Retrieval and Natural Language Processing, requires continuous updates to remain effective. Incremental Learning, unlike conventional methods, updates existing knowledge without retraining from scratch. This paper presents an incremental learning algorithm for updating domain-specific vocabularies. It introduces DocLib, an archive used to capture a compact footprint of previously seen data and vocabulary terms. Task-based evaluation measures the effectiveness of the updated vocabulary by using vocabulary terms to perform a downstream task of text classification. The classification accuracy gauges the effectiveness of the vocabulary in discerning unseen documents related to the domain. Experiments illustrate that multiple incremental updates maintain vocabulary relevance without compromising its effectiveness. The proposed algorithm ensures bounded memory and processing requirements, distinguishing it from conventional approaches. Novel algorithms are introduced to assess the stability and plasticity of the proposed approach, demonstrating its ability to assimilate new knowledge while retaining old insights. The generalizability of the vocabulary is tested across datasets, achieving 97.89% accuracy in identifying domain-related data. A comparison with state-of-the-art techniques using a benchmark dataset confirms the effectiveness of the proposed approach. Importantly, this approach extends beyond classification tasks, potentially benefiting other research fields.
特定领域词汇在信息检索和自然语言处理等领域至关重要,需要不断更新以保持有效性。与传统方法不同,增量学习无需从头重新训练即可更新现有知识。本文提出了一种用于更新特定领域词汇的增量学习算法。它引入了DocLib,一个用于捕获先前见过的数据和词汇项的紧凑记录的存档。基于任务的评估通过使用词汇项执行文本分类的下游任务来衡量更新后词汇的有效性。分类准确率衡量词汇在辨别与该领域相关的未见文档方面的有效性。实验表明,多次增量更新可保持词汇相关性而不影响其有效性。所提出的算法确保了有限的内存和处理需求,这使其有别于传统方法。引入了新颖的算法来评估所提出方法的稳定性和可塑性,证明了其在保留旧见解的同时吸收新知识的能力。在多个数据集上测试了词汇的通用性,在识别与领域相关的数据方面达到了97.89%的准确率。使用基准数据集与现有技术进行比较证实了所提出方法的有效性。重要的是,这种方法不仅适用于分类任务,还可能使其他研究领域受益。