Suppr超能文献

用于跨语言文档检索的深度多标签多语言文档学习

Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval.

作者信息

Feng Kai, Huang Lan, Xu Hao, Wang Kangping, Wei Wei, Zhang Rui

机构信息

College of Computer Science and Technology, Jilin University, Changchun 130012, China.

School of International Economics and Trade, Changchun University of Finance and Economics, Changchun 130012, China.

出版信息

Entropy (Basel). 2022 Jul 7;24(7):943. doi: 10.3390/e24070943.

Abstract

Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information. In this work, the cross-lingual comparison at the document level is achieved through the cross-lingual semantic space. Our method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space. The semantic distances can be calculated when the cross-lingual documents are transformed into embeddings in semantic space. The supervision signals are automatically extracted from the data and then used to construct the semantic space via a linear classifier. The ambiguity of manual labels could be avoided and the multilabel supervision signals can be acquired instead of a single label. The representation of the semantic space is enriched by multilabel supervision signals, which improves the discriminative ability of the embeddings. The MDL is easy to extend to other fields since it does not depend on specific data. Furthermore, MDL is more efficient than the models training all languages jointly, since each language is trained individually. Experiments on Wikipedia data showed that the proposed method outperforms the state-of-the-art cross-lingual document retrieval methods.

摘要

跨语言文档检索旨在使用一种语言的查询来检索另一种语言中的相关文档,在过去几十年中引起了强烈的研究兴趣。关于这项任务的大多数研究都从单词层面的跨语言比较开始,然后通过词嵌入来表示文档,这导致结构信息不足。在这项工作中,通过跨语言语义空间实现了文档层面的跨语言比较。我们的方法MDL(深度多标签多语言文档学习)利用一个六层全连接网络将跨语言文档投影到一个共享语义空间中。当跨语言文档在语义空间中转换为嵌入时,可以计算语义距离。监督信号从数据中自动提取,然后通过线性分类器用于构建语义空间。可以避免手动标签的模糊性,并且可以获取多标签监督信号而不是单个标签。多标签监督信号丰富了语义空间的表示,提高了嵌入的判别能力。MDL易于扩展到其他领域,因为它不依赖于特定数据。此外,MDL比联合训练所有语言的模型更高效,因为每种语言都是单独训练的。在维基百科数据上的实验表明,所提出的方法优于当前最先进的跨语言文档检索方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/f9da8855dced/entropy-24-00943-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验